Natural selection of protein structural and functional properties: a single nucleotide polymorphism perspective
© Liu et al.; licensee BioMed Central Ltd. 2008
Received: 20 March 2008
Accepted: 8 April 2008
Published: 08 April 2008
The rates of molecular evolution for protein-coding genes depend on the stringency of functional or structural constraints. The Ka/Ks ratio has been commonly used as an indicator of selective constraints and is typically calculated from interspecies alignments. Recent accumulation of single nucleotide polymorphism (SNP) data has enabled the derivation of Ka/Ks ratios for polymorphism (SNP A/S ratios).
Using data from the dbSNP database, we conducted the first large-scale survey of SNP A/S ratios for different structural and functional properties. We confirmed that the SNP A/S ratio is largely correlated with Ka/Ks for divergence. We observed stronger selective constraints for proteins that have high mRNA expression levels or broad expression patterns, have no paralogs, arose earlier in evolution, have natively disordered regions, are located in cytoplasm and nucleus, or are related to human diseases. On the residue level, we found higher degrees of variation for residues that are exposed to solvent, are in a loop conformation, natively disordered regions or low complexity regions, or are in the signal peptides of secreted proteins. Our analysis also revealed that histones and protein kinases are among the protein families that are under the strongest selective constraints, whereas olfactory and taste receptors are among the most variable groups.
Our study suggests that the SNP A/S ratio is a robust measure for selective constraints. The correlations between SNP A/S ratios and other variables provide valuable insights into the natural selection of various structural or functional properties, particularly for human-specific genes and constraints within the human lineage.
It is well established that there are tremendous variations in rates of evolution among protein-coding genes. A central problem in molecular evolution is to identify factors that determine the rate of protein evolution. One widely accepted principle is that a major force governing the rate of amino acid substitution is the stringency of functional or structural constraints. Proteins with rigorous functional or structural requirements are subject to strong purifying (negative) selective pressure, resulting in smaller numbers of amino acid changes. Therefore, these proteins tend to evolve slower than proteins with weaker constraints. A classic measure for selective pressure on protein-coding genes is the Ka/Ks ratio , that is, the ratio of non-synonymous (amino acid changing) substitutions per non-synonymous site to synonymous (silent) substitutions per synonymous site. The assumption is that synonymous sites are subject to only background nucleotide mutation, whereas non-synonymous sites are subject to both background mutation and amino acid selective pressure. Thus, the ratio of the observed non-synonymous mutation rate (Ka) to the synonymous mutation rate (Ks) can be utilized as an estimate of the selective pressure, where Ka/Ks « 1 suggests that most amino acid substitutions have been eliminated by selection, that is, strong purifying selection. Ka/Ks ratios for protein-coding genes are generally derived from inter-species sequence alignments and different evolution models have been developed to accurately estimate the ratios . There have been many studies using Ka/Ks ratios to measure evolutionary constraints among different classes of proteins. For example, it has been suggested that essential genes in bacteria evolve slower than non-essential genes , that house-keeping genes are under stronger selective constraints than tissue-specific genes , and that secreted proteins are under less purifying selection based on Ka/Ks ratios from human-mouse sequence alignments .
In the past few years, advances in sequencing technology have led to a rapid accumulation of DNA variation data for human populations, including copy number variations and single nucleotide polymorphisms (SNPs). Currently, the dbSNP database  at the National Center of Biotechnology Information (NCBI) catalogues about 12 million human SNPs, close to half of which are validated. It has also been shown by several independent sequencing studies that dbSNP has high coverage of frequent SNPs [7, 8]. The vast amount of SNP data can not only shed light on the variation in disease susceptibility and drug response among human populations, but also help us understand molecular evolution. In particular, these SNP data have provided us with another way of measuring evolutionary constraints, based on a prediction of the neutral theory of molecular evolution that A/S ratios should be highly correlated between intra-species polymorphism and inter-species divergence . In fact, SNP A/S ratios (also referred to as Ka/Ks ratios for polymorphisms) have been calculated to determine whether there is frequent positive selection on the human genome [10, 11], and they have been compared with Ka/Ks for human-chimpanzee divergence . However, it is not clear whether SNP A/S ratios are closely correlated with Ka/Ks in practice given the current volume of SNP data, and there have not been any large-scale studies of selective constraints on protein structural and functional properties using SNP data.
In the present study, we conducted a large-scale survey of SNP A/S ratios using SNP data from dbSNP. We first confirmed that the SNP A/S ratio is a good measure for selective pressure by showing its correlation with Ka/Ks from inter-species alignments and protein alignment conservation. We then obtained a variety of structural and functional properties from either database annotations or computational prediction methods and analyzed SNP A/S ratios for different classes of proteins and residues in an attempt to study the natural selection of these properties from the SNP perspective. Our comprehensive analysis provides: valuable insight into some features that have not been examined previously; independent confirmation of some previously established results; and additional data for areas where previous studies have had contradictory findings.
We collected 13,686 human genes that have at least one validated coding SNP according to dbSNP. The analysis was limited to validated SNPs to ensure data quality. Overall, 45,538 coding-region SNPs and 1,529,119 intronic SNPs were identified in these genes, corresponding to SNP densities of 2.0 and 2.4 SNPs, respectively, per 1,000 nucleotides. The number of non-synonymous coding SNPs per non-synonymous site (A) is 0.00123, the number of synonymous coding SNPs per synonymous site (S) is 0.00439, and the A/S ratio is 0.28. The values of A and S are both two times more than what have been reported in a small study , but the A/S ratio is similar.
SNP A/S ratio as a measure for selective constraints
We next investigated whether the conservation in protein sequences correlates with the SNP A/S ratio under the assumption that both the conservation at the protein sequence level and the SNP A/S ratio at the nucleotide level are indications for selective constraints. Using the position-specific alignment entropy (a measure for conservation) from PSI-BLAST profiles , we calculated A/S ratios for residues with different conservation scores. We indeed observed a monotonic decrease of the A/S ratio with an increase in protein sequence conservation (Figure 1b). The residues with the conservation range of 0-0.5 have a ratio of 0.33, while those having conservation scores bigger than 3.5 have an A/S ratio of 0.06.
SNP A/S ratios for protein features
Many studies have been published addressing the correlation between evolutionary constraints and other variables, most of which were based on relatively small data sets. Having established the SNP A/S ratio as a good measure for selective constraints, we attempted to use the large-scale human SNP data set to revisit some of the features in the earlier studies, and also to investigate several protein properties that had not been examined before.
Selective constraints and mRNA expression
SNP A/S ratio and evolutionary variables
To determine whether the SNP A/S ratio correlates with the age of proteins, we classified each protein into one of seven age groups according to their most ancient homologs. It appears that young proteins (for example, those found in human or primates only) have the highest A/S ratios (0.76 for human and 0.66 for primates), whereas proteins traceable to all animals or other eukaryotes have much lower ratios of about 0.25 (Figure 3b). This is consistent with a previous finding that proteins that arose earlier in evolution tend to have a larger proportion of sites subjected to negative selection , although there was some debate about whether the observation was an artifact resulting from the inability of BLAST to detect homology for the fastest-evolving genes [30, 31]. We examined the functions of proteins in each group by their Gene Ontology (GO)  annotation of biological process. The human-specific group is the least well annotated, with only 6% having GO annotation compared with 62% overall and 84% for proteins conserved in both eukaryotes and prokaryotes (the 'universal' group). Among the proteins with GO annotation of biological process, we observed the enrichment of 'epidermis development', 'defense response to bacterium', and 'spermatogenesis' in the human and primate groups, whereas 'amino acid metabolic process', 'glycolysis', and 'fatty acid metabolic process' are overrepresented in the 'universal' group.
SNP A/S ratios and sequence/structure variables
Solvent accessibility measures the degree of an amino acid residue's exposure to the surrounding solvent. There have been a number of studies about the effect of mutations on solvent accessibility and its implication in human diseases; most of them were based on relatively small collections of SNPs in known protein structures. The general consensus was that buried residues are less likely to vary and their mutations are more likely to cause disease [38, 39]. We obtained solvent accessibility predictions for all proteins in our dataset using PROFacc , and compared the SNP A/S ratios. Exposed residues have an A/S ratio of 0.31, significantly higher than that of 0.24 for the buried residues (Figure 4b). The p-value for this difference is smaller than 1e-04 according to bootstrap analysis. Similar results were obtained when using three-state prediction (buried, intermediate, and exposed) or numeric relative accessibility values. This underscores higher selective constraints on buried residues, possibly due to their importance in maintaining protein stability.
We also investigated selective constraints upon different protein structure conformations. We first grouped all residues into different secondary structure conformations (alpha-helix, beta-strand, or loop) according to predictions by PSIPRED . Significantly higher A/S ratios were observed for residues in the loop conformation (Figure 4c), suggesting relaxed selective pressure on these residues. There is no difference between residues in alpha-helices and beta-strands. We next examined natively disordered proteins, a class of structurally flexible proteins that have recently gained traction because of their potential important roles in dynamic molecular recognition of macromolecules . It has been estimated that one-third of eukaryotic proteins contains disordered regions , and that they are more likely to be involved in regulatory functions and protein-protein interactions [44, 45]. We obtained disorder predictions using DISOPRED2  and retained only the disordered regions longer than 30 residues. Interestingly, while proteins with disordered regions have a lower A/S ratio (Figure 4d; Figure S2b in Additional data file 1), the residues in disordered regions have a much higher A/S ratio than other residues (0.38 versus 0.22; Figure 4d). This seems to suggest that disordered proteins as a class are under stronger selective pressure, but the disordered residues are allowed to evolve much faster to explore different ways to interact with other molecules. Since disordered regions are often characterized by low sequence complexity [42, 44], we also examined the selective constraints on low complexity regions as defined by SEG . Not surprisingly, low complexity regions have a higher A/S ratio, but the profile is different from that of the disordered regions (Figure 4e), confirming that disorder and low complexity are related but different sequence features.
SNP A/S ratios and protein subcellular localization
Selective constraints on functional classes and protein families
GO biological process categories with the highest and lowest SNP A/S ratios
Number of proteins
Sensory perception of smell
Response to stimulus
G-protein coupled receptor protein signaling pathway
Innate immune response
Fatty acid biosynthetic process
Response to unfolded protein
Regulation of translation
Calcium ion transport
Chromatin assembly or disassembly
Nuclear mRNA splicing, via spliceosome
Protein kinase C activation
Pfam families with the highest and lowest SNP A/S ratios
Number of proteins
Mammalian taste receptor protein (TAS2R)
7 transmembrane receptor (rhodopsin family)
Glutathione S-transferase, amino-terminal domain
Glutathione S-transferase, carboxy-terminal domain
Putative regulatory protein (CxxC_CxxC_SSSS)
Lectin C-type domain
alpha/beta Hydrolase fold
Small cytokines (intecrine/chemokine), interleukin-8 like
Zinc finger, C4 type (two domains)
SAM domain (Sterile alpha motif)
Core histone H2A/H2B/H3/H4
Glycosyl transferase family 2
BTB And C-terminal Kelch
Protein kinase C terminal domain
Selective pressure on disease-related proteins
Selective constraints and protein-protein interaction
SNP A/S ratio and splicing
Measuring selective constraints with SNP A/S ratios
The average SNP density in our data set is about 2 SNPs per 1,000 nucleotides. Although we have limited our analysis to proteins with at least one validated coding SNP, many proteins in our set still have no non-synonymous SNPs or synonymous SNPs. Therefore, it is neither practical nor reliable to measure the selective constraints on individual proteins using the SNP A/S ratio. Nevertheless, we have demonstrated that when a group of proteins (or residues) are measured together, the measure can be quite robust and often in good agreement with Ka/Ks for divergence. For example, the SNP A/S ratio for proteins with paralogs in our data set is 0.47, very close to the Ka/Ks ratio of 0.45 for duplicated mammalian genes reported earlier . Although this may not seem surprising given that it is an expected prediction from the neutral theory of evolution, our analysis showed that the SNP A/S ratio can indeed be a practical and robust measure given the current volume of SNP data in public databases.
There are two unique advantages of using SNP A/S ratios to measure selective constraints. First, for human-specific genes and genes whose ortholog relationships can not be determined reliably, inter-species Ka/Ks ratios can not be estimated. For example, among the 13,686 genes in our data set, human-mouse Ka/Ks ratios for 3,937 genes are not available through Ensemble. In those cases, SNP A/S ratios provide an alternative way of measuring selective pressure by the same evolutionary principle. Second, the SNP A/S ratio is a direct measure of the selective constraints specific to the human lineage, which can not be obtained from Ka/Ks for species divergence.
In comparison with other simpler measures used in some earlier SNP studies (for example, ), such as density of synonymous and non-synonymous SNPs or the fraction of SNPs in different protein classes, the SNP A/S ratio offers several advantages: its interpretation is clearer from evolutionary theories; it is not subject to data selection bias arising from the popularity of genes in sequencing efforts, which could be a problem when using SNP density; and it is normalized by the number of synonymous and non-synonymous sites, so the numbers are comparable across different protein classes or different studies.
Correlation between SNP distribution and structural/functional constraints
To address the issue of determinants of molecular evolution, many studies have been published examining the correlation between variables that characterize the evolution, expression and function of genes. However, as noted in the Results section, there are many controversies among those reports. It is clear that the validity and significance of correlations depend on many factors, including feature variables, data sets, and the correlation measure. Here, we provide the first large-scale study of SNP A/S ratios, and reported correlations between the ratio and a number of variables; not surprisingly, some are not in agreement with previous studies. Although further studies are still needed to draw definitive conclusions about the major determinants, we agree with many others that it is likely that many of these variables correlate with each other in some way, and some of them are secondary effects rather than primary determinants . Several statistical methods, such as partial correlation and principle component regression, have been used to attempt to dissect these complex and rich connections; yet it remains an important open challenge in molecular evolution.
Molecular evolution is a dynamic process, with strong implications for both differences between species and variations within a given species. SNPs in the human genome perhaps capture a glimpse of this dynamic process and, therefore, could offer key insights missed by conventional cross-species comparisons. Here, we first established the SNP A/S ratio as a reliable metric for studying the selective constraints for molecular evolution, and then used this metric to systematically investigate a large number of protein features that contribute to differences in molecular evolution rate. Our study provided the first such large-scale survey based on SNP A/S ratios, leading to novel insights into features that have not been examined before and clarification of findings that were contradictory in the literature.
Materials and methods
Source of sequences and SNP data
We first filtered the human entries in the NCBI's Entrez Gene database  by the presence of RefSeq  records, and limited our analysis to genes that have at least one validated coding SNP according to dbSNP  build 127. For each gene, we chose one transcript (and its corresponding protein product) by selecting the transcript with the most validated coding SNPs, most advanced RefSeq annotation status, and the longest sequence. We also discarded proteins that are longer than 5,000 residues or shorter than 25 residues. Our final data set has 13,686 proteins with 45,538 coding SNPs.
mRNA expression data
where n = 54 is the number of human tissues examined here, S j is the expression level in each tissue, and Smax is the highest expression level of the probe set across all tissues. When a gene has multiple probe sets, its expression levels and tissue specificity were represented by the probe set with the highest mean expression level.
Affymetrix present/absent calls were used to calculate the breadth of the expression. A probe set was considered 'present' in a tissue if it had 'present' calls in no less than half of the samples in that tissue, and the expression breadth of a probe set was defined as the number of tissues in which the probe set was 'present'. When a gene has multiple probe sets, its expression breadth was represented by the probe set with the highest value of breadth.
Structural and functional features
Ka/Ks ratios from human-mouse alignments were downloaded from Ensembl . The information about human paralogs was extracted from HomoloGene database [27, 28] release 57; the existence of paralogs is indicated by the presence of other human proteins in the same homology group. To investigate the degree of conservation throughout the evolutionary history, that is, the age of a protein, we performed BLAST searches  of each protein sequence against NCBI's RefSeq database, and collected all hits with an e-value < 1e-10 and a protein length difference smaller than 30% as potential homologs. The query protein was then classified into one of the seven age groups (human, primate, mammal, vertebrate, animal, eukaryote, or universal) according to its most ancient homolog. The use of different e-values and length difference cutoffs did not change the results qualitatively. We also obtained the conservation score for each residue in all proteins in our data set by running PSI-BLAST  against NCBI's nr database (parameters: '-j 3 -h 5e-3 -F F') and taking the values in the 'information per position' (a measure of alignment entropy) column from the ASCII format of PSI-BLAST profiles.
We obtained protein structure features by the following computational methods using their default parameters: two-state (exposed or buried) solvent accessibility by PROFacc , signal peptides by SignalP version 3.0 , transmembrane helices by TMHMM 2.0 , secondary structures by PSIPRED , sequence complexity by SEG , and natively disordered proteins by Disopred2 . Disopred2 predictions were subsequently filtered to retain only the disordered regions longer than 30 residues.
GO annotations  were extracted from the GenBank record of each gene, and the GO terms were subsequently mapped to a selection of high-level terms according to GOA slim [52, 69]. For cellular component annotations, the GOA slim terms were further collapsed into four categories (extracellular region, nucleus, cytoplasm, and membrane) where appropriate. Proteins assigned to more than one of these four categories were excluded from the analysis of subcellular localization by GO annotation. Subcellular localization prediction for non-membrane proteins (that is, no transmembrane helix predictions from TMHMM) were also obtained by using LOCtree . To assign proteins to protein families, we ran hmmpfam from the HMMER package (version 2.3) against Pfam_ls  models (release 21.0) and obtained Pfam family hits with an e-value < 0.01. Numbers of protein-protein interaction partners were obtained from the IntAct database [64, 70] (28 September 2007 release). We also obtained disease related genes from CGC (243 genes) [57, 71], COSMIC [58, 72] (3,103 genes), and OMIM databases [27, 73] on 28 September 2007. For OMIM, only those genes with the 'confirmed' status (2,334 genes) were included in the analysis.
Sequences within 70 bp of an exon-intron junction were collected in the same way as described in . Briefly, transcripts with less than three exons were excluded from the analysis. All internal exons were trimmed so that the first base was the first base of the first complete codon, and the last base the last of the final complete codon. The first and last codons were then removed from each exon, and remaining codons within 70 bp of the intron-exon boundary were defined as sequences close to the boundary.
The features we considered can also be divided into two classes: protein-level features and residue-level features. A protein-level feature describes the property of an entire protein, for example, whether a protein has a transmembrane helix or not; in contrast, a residue-level feature describes the property of a subset of residues within a protein, for example, whether a residue resides in the transmembrane helix or not.
The SNP A/S ratio, also known as the Ka/Ks ratio for polymorphism, is defined as the ratio of the number of non-synonymous SNPs per non-synonymous site to the number of synonymous SNPs per synonymous site. The numbers of synonymous sites and non-synonymous sites were calculated using the method of Miyata and Yasunaga . The ratio for a set of proteins (or residues) was calculated by summing the number of SNPs and the number of sites to obtain A and S for the concatenated set before taking the ratio.
We performed bootstrap re-sampling analysis to assess the statistical significance of the differences observed in A/S ratios between different groups. We obtained 10,000 bootstrap replicates by re-sampling with replacement from the original data set. A/S ratios for different groups were calculated for each replicate, and confidence intervals and p-values of the differences were obtained from those 10,000 sets of A/S ratios. For protein-level feature groups, re-samplings were performed within each group, so that those re-sampled data sets had the same number of proteins in each group as the original data set had. For residue-level feature groups, the re-sampled data sets were constructed by re-sampling the 13,686 proteins in our entire data set.
Monte Carlo samplings were used to assess the differences in A/S ratios between groups when controlling for expression parameters. Briefly, the distribution of expression breadth (or expression level) for the group with the smallest number of genes was set as the target distribution; genes in all other groups were then sampled without replacement using Monte Carlo simulation so that all groups had the same distribution of the expression breadth (or level). The effectiveness of Monte Carlo samplings was confirmed by Kolmogorov-Smirnov tests. An example is shown in Figure S1 in Additional data file 1. For each group, 100 samplings were performed, and the A/S ratio for the group was taken as the mean of A/S ratios from the 100 Monte Carlo samples.
The random grouping was performed 100 times. The correlation and partial correlation coefficients were computed from these 100 samples, and the medians of those 100 sets of coefficients were reported.
Additional data files
The following additional data are available. Additional data file 1 includes Table S1 and Figures S1-S3.
Cancer Gene Census
Catalogue of Somatic Mutations in Cancer
G protein-coupled receptor
non-synonymous substitutions per non-synonymous site
synonymous substitutions per synonymous site
Online Mendelian Inheritance in Man
single nucleotide polymorphism.
We thank Kiran Mukhyala and Reece Hart for technical assistance, Joshua Kaminker, Peter Haverty, Shiuh-Ming Luoh, Peng Yue, and Colin Watanabe for helpful discussions, and Burkhard Rost and Rajesh Nair (Columbia University) for providing LOCtree and PROFacc predictions.
- Hurst LD: The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet. 2002, 18: 486-10.1016/S0168-9525(02)02722-1.PubMedView ArticleGoogle Scholar
- Li W-H: Molecular Evolution. 1997, Sunderland, Massachusetts: Sinauer Associates, Inc.Google Scholar
- Jordan IK, Rogozin IB, Wolf YI, Koonin EV: Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res. 2002, 12: 962-968. 10.1101/gr.87702. Article published online before print in May 2002.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang L, Li WH: Mammalian housekeeping genes evolve more slowly than tissue-specific genes. Mol Biol Evol. 2004, 21: 236-239. 10.1093/molbev/msh010.PubMedView ArticleGoogle Scholar
- Mouse Genome Sequencing Consortium, Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, et al: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-562. 10.1038/nature01262.View ArticleGoogle Scholar
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29: 308-311. 10.1093/nar/29.1.308.PubMedPubMed CentralView ArticleGoogle Scholar
- Jiang R, Duan J, Windemuth A, Stephens JC, Judson R, Xu C: Genome-wide evaluation of the public SNP databases. Pharmacogenomics. 2003, 4: 779-789. 10.1517/phgs.4.6.779.22821.PubMedView ArticleGoogle Scholar
- Freudenberg-Hua Y, Freudenberg J, Winantea J, Kluck N, Cichon S, Bruss M, Propping P, Nöthen MM: Systematic investigation of genetic variability in 111 human genes-implications for studying variable drug response. Pharmacogenomics J. 2005, 5: 183-192. 10.1038/sj.tpj.6500306.PubMedView ArticleGoogle Scholar
- Kimura M: The neutral theory of molecular evolution. Evolution of Genes and Proteins. Edited by: Nei M, Koehn RK. 1983, Sunderland, Massachusetts: Sinauer Associates, Inc., 208-233.Google Scholar
- Fay JC, Wyckoff GJ, Wu CI: Positive and negative selection on the human genome. Genetics. 2001, 158: 1227-1234.PubMedPubMed CentralGoogle Scholar
- Zhang L, Li WH: Human SNPs reveal no evidence of frequent positive selection. Mol Biol Evol. 2005, 22: 2504-2507. 10.1093/molbev/msi240.PubMedView ArticleGoogle Scholar
- Chimpanzee Sequencing and Analysis Consortium: Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005, 437: 69-87. 10.1038/nature04072.View ArticleGoogle Scholar
- Ensembl. [http://www.ensembl.org]
- Gibbons JD: Nonparametric Measures of Association. 1993, Newbury Park: Sage PublicationsGoogle Scholar
- Altschul SF, Madden TL, Shäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Rocha EP, Danchin A: An analysis of determinants of amino acids substitution rates in bacterial proteins. Mol Biol Evol. 2004, 21: 108-116. 10.1093/molbev/msh004.PubMedView ArticleGoogle Scholar
- Pál C, Papp B, Hurst LD: Highly expressed genes in yeast evolve slowly. Genetics. 2001, 158: 927-931.PubMedPubMed CentralGoogle Scholar
- Subramanian S, Kumar S: Gene expression intensity shapes evolutionary rates of the proteins encoded by the vertebrate genome. Genetics. 2004, 168: 373-381. 10.1534/genetics.104.028944.PubMedPubMed CentralView ArticleGoogle Scholar
- Drummond DA, Raval A, Wilke CO: A single determinant dominates the rate of yeast protein evolution. Mol Biol Evol. 2006, 23: 327-337. 10.1093/molbev/msj038.PubMedView ArticleGoogle Scholar
- Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH: Why highly expressed proteins evolve slowly. Proc Natl Acad Sci USA. 2005, 102: 14338-14343. 10.1073/pnas.0504070102.PubMedPubMed CentralView ArticleGoogle Scholar
- Plotkin JB, Fraser HB: Assessing the determinants of evolutionary rates in the presence of noise. Mol Biol Evol. 2007, 24: 1113-1121. 10.1093/molbev/msm044.PubMedView ArticleGoogle Scholar
- Duret L, Mouchiroud D: Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol Biol Evol. 2000, 17: 68-74.PubMedView ArticleGoogle Scholar
- Lercher MJ, Chamary JV, Hurst LD: Genomic regionality in rates of evolution is not explained by clustering of genes of comparable expression profile. Genome Res. 2004, 14: 1002-1013. 10.1101/gr.1597404.PubMedPubMed CentralView ArticleGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA. 2004, 101: 6062-6067. 10.1073/pnas.0400782101.PubMedPubMed CentralView ArticleGoogle Scholar
- Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV: Selection in the evolution of gene duplications. Genome Biol. 2002, 3: research0008.1-0008.9. 10.1186/gb-2002-3-2-research0008.View ArticleGoogle Scholar
- Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes. Science. 2000, 290: 1151-1155. 10.1126/science.290.5494.1151.PubMedView ArticleGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Ostell J, Miller V, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2007, 35 (Database issue): D5-D12. 10.1093/nar/gkl1031.PubMedPubMed CentralView ArticleGoogle Scholar
- HomoloGene. [http://www.ncbi.nlm.nih.gov/sites/entrez?db=homologene]
- Albà MM, Castresana J: Inverse relationship between evolutionary rate and age of mammalian genes. Mol Biol Evol. 2005, 22: 598-606. 10.1093/molbev/msi045.PubMedView ArticleGoogle Scholar
- Albà MM, Castresana J: On homology searches by protein Blast and the characterization of the age of genes. BMC Evol Biol. 2007, 7: 53-10.1186/1471-2148-7-53.PubMedPubMed CentralView ArticleGoogle Scholar
- Elhaik E, Sabath N, Graur D: The "inverse relationship between evolutionary rate and age of mammalian genes" is an artifact of increased genetic distance with rate of evolution and time of divergence. Mol Biol Evol. 2006, 23: 1-3. 10.1093/molbev/msj006.PubMedView ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- Lemos B, Bettencourt BR, Meiklejohn CD, Hartl DL: Evolution of proteins and gene expression levels are coupled in Drosophila and are independently associated with mRNA abundance, protein length, and number of protein-protein interactions. Mol Biol Evol. 2005, 22: 1345-1354. 10.1093/molbev/msi122.PubMedView ArticleGoogle Scholar
- Larracuente AM, Sackton TB, Greenberg AJ, Wong A, Singh ND, Sturgill D, Zhang Y, Oliver B, Clark AG: Evolution of protein-coding genes in Drosophila. Trends Genet. 2008, 24: 114-123. 10.1016/j.tig.2007.12.001.PubMedView ArticleGoogle Scholar
- Zhang J: Protein-length distributions for the three domains of life. Trends Genet. 2000, 16: 107-109. 10.1016/S0168-9525(99)01922-8.PubMedView ArticleGoogle Scholar
- Liao BY, Scott NM, Zhang J: Impacts of gene essentiality, expression pattern, and gene compactness on the evolutionary rate of mammalian proteins. Mol Biol Evol. 2006, 23: 2072-2080. 10.1093/molbev/msl076.PubMedView ArticleGoogle Scholar
- Lipman DJ, Souvorov A, Koonin EV, Panchenko AR, Tatusova TA: The relationship of protein conservation and sequence length. BMC Evol Biol. 2002, 2: 20-10.1186/1471-2148-2-20.PubMedPubMed CentralView ArticleGoogle Scholar
- Ferrer-Costa C, Orozco M, de la Cruz X: Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties. J Mol Biol. 2002, 315: 771-786. 10.1006/jmbi.2001.5255.PubMedView ArticleGoogle Scholar
- Wang Z, Moult J: SNPs, protein structure, and disease. Hum Mutat. 2001, 17: 263-270. 10.1002/humu.22.PubMedView ArticleGoogle Scholar
- Rost B: How to use protein 1D structure predicted by PROFphd. The Proteomics Protocols Handbook. Edited by: Walker JE. 2005, Totowa NJ: Humana, 875-901.View ArticleGoogle Scholar
- Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999, 292: 195-202. 10.1006/jmbi.1999.3091.PubMedView ArticleGoogle Scholar
- Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, Oldfield CJ, Campen AM, Ratliff CM, Hipps KW, Ausio J, Nissen MS, Reeves R, Kang C, Kissinger CR, Bailey RW, Griswold MD, Chiu W, Garner EC, Obradovic Z: Intrinsically disordered protein. J Mol Graph Model. 2001, 19: 26-59. 10.1016/S1093-3263(00)00138-8.PubMedView ArticleGoogle Scholar
- Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004, 337: 635-645. 10.1016/j.jmb.2004.02.002.PubMedView ArticleGoogle Scholar
- Liu J, Tan H, Rost B: Loopy proteins appear conserved in evolution. J Mol Biol. 2002, 322: 53-64. 10.1016/S0022-2836(02)00736-2.PubMedView ArticleGoogle Scholar
- Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN: Flexible nets. The roles of intrinsic disorder in protein interaction networks. FEBS J. 2005, 272: 5129-5148. 10.1111/j.1742-4658.2005.04948.x.PubMedView ArticleGoogle Scholar
- Wootton JC, Federhen S: Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 1996, 266: 554-571.PubMedView ArticleGoogle Scholar
- Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002, 30: 3894-3900. 10.1093/nar/gkf493.PubMedPubMed CentralView ArticleGoogle Scholar
- Julenius K, Pedersen AG: Protein evolution is faster outside the cell. Mol Biol Evol. 2006, 23: 2039-2048. 10.1093/molbev/msl081.PubMedView ArticleGoogle Scholar
- Bendtsen JD, Nielsen H, von Heijne G, Brunak S: Improved prediction of signal peptides: SignalP 3.0. J Mol Biol. 2004, 340: 783-795. 10.1016/j.jmb.2004.05.028.PubMedView ArticleGoogle Scholar
- Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001, 305: 567-580. 10.1006/jmbi.2000.4315.PubMedView ArticleGoogle Scholar
- Nair R, Rost B: Mimicking cellular sorting improves prediction of subcellular localization. J Mol Biol. 2005, 348: 85-100. 10.1016/j.jmb.2005.02.025.PubMedView ArticleGoogle Scholar
- Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004, 32 (Database issue): D262-D266. 10.1093/nar/gkh021.PubMedPubMed CentralView ArticleGoogle Scholar
- Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A: Pfam: clans, web tools and services. Nucleic Acids Res. 2006, 34 (Database issue): D247-D251. 10.1093/nar/gkj149.PubMedPubMed CentralView ArticleGoogle Scholar
- Smith NG, Eyre-Walker A: Human disease genes: patterns and predictions. Gene. 2003, 318: 169-175. 10.1016/S0378-1119(03)00772-8.PubMedView ArticleGoogle Scholar
- Huang H, Winter EE, Wang H, Weinstock KG, Xing H, Goodstadt L, Stenson PD, Cooper DN, Smith D, Alba MM, Ponting CP, Fechtel K: Evolutionary conservation and selection of human disease gene orthologs in the rat and mouse genomes. Genome Biol. 2004, 5: R47-10.1186/gb-2004-5-7-r47.PubMedPubMed CentralView ArticleGoogle Scholar
- Kondrashov FA, Ogurtsov AY, Kondrashov AS: Bioinformatical assay of human gene morbidity. Nucleic Acids Res. 2004, 32: 1731-1737. 10.1093/nar/gkh330.PubMedPubMed CentralView ArticleGoogle Scholar
- Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer. 2004, 4: 177-183. 10.1038/nrc1299.PubMedPubMed CentralView ArticleGoogle Scholar
- Forbes S, Clements J, Dawson E, Bamford S, Webb T, Dogan A, Flanagan A, Teague J, Wooster R, Futreal PA, Stratton MR: COSMIC 2005. Br J Cancer. 2006, 94: 318-322. 10.1038/sj.bjc.6602928.PubMedPubMed CentralView ArticleGoogle Scholar
- Tu Z, Wang L, Xu M, Zhou X, Chen T, Sun F: Further understanding human disease genes by comparing with housekeeping genes and other genes. BMC Genomics. 2006, 7: 31-10.1186/1471-2164-7-31.PubMedPubMed CentralView ArticleGoogle Scholar
- Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW: Evolutionary rate in the protein interaction network. Science. 2002, 296: 750-752. 10.1126/science.1068696.PubMedView ArticleGoogle Scholar
- Jordan IK, Wolf YI, Koonin EV: No simple dependence between protein evolution rate and the number of protein-protein interactions: only the most prolific interactors tend to evolve slowly. BMC Evol Biol. 2003, 3: 1-10.1186/1471-2148-3-1.PubMedPubMed CentralView ArticleGoogle Scholar
- Bloom JD, Adami C: Apparent dependence of protein evolutionary rate on number of interactions is linked to biases in protein-protein interactions data sets. BMC Evol Biol. 2003, 3: 21-10.1186/1471-2148-3-21.PubMedPubMed CentralView ArticleGoogle Scholar
- Batada NN, Hurst LD, Tyers M: Evolutionary and physiological importance of hub proteins. PLoS Comput Biol. 2006, 2: e88-10.1371/journal.pcbi.0020088.PubMedPubMed CentralView ArticleGoogle Scholar
- Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E, Feuermann M, Friedrichsen A, Huntley R, Kohler C, Khadake J, Leroy C, Liban A, Lieftink C, Montecchi-Palazzi L, Orchard S, Risse J, Robbe K, Roechert B, Thorneycroft D, Zhang Y, Apweiler R, Hermjakob H: IntAct - open source resource for molecular interaction data. Nucleic Acids Res. 2007, 35 (Database issue): D561-D565. 10.1093/nar/gkl958.PubMedPubMed CentralView ArticleGoogle Scholar
- Parmley JL, Urrutia AO, Potrzebowski L, Kaessmann H, Hurst LD: Splicing and the evolution of proteins in mammals. PLoS Biol. 2007, 5: e14-10.1371/journal.pbio.0050014.PubMedPubMed CentralView ArticleGoogle Scholar
- Koonin EV, Wolf YI: Evolutionary systems biology: links between gene evolution and function. Curr Opin Biotechnol. 2006, 17: 481-487. 10.1016/j.copbio.2006.08.003.PubMedView ArticleGoogle Scholar
- Gene Expression Atlas. [http://wombat.gnf.org/index.html]
- Yanai I, Benjamin H, Shmoish M, Chalifa-Caspi V, Shklar M, Ophir R, Bar-Even A, Horn-Saban S, Safran M, Domany E, Lancet D, Shmueli O: Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics. 2005, 21: 650-659. 10.1093/bioinformatics/bti042.PubMedView ArticleGoogle Scholar
- GOA slim. [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/goslim/goaslim.map]
- IntAct. [ftp://ftp.ebi.ac.uk/pub/databases/intact/]
- Cancer Gene Census. [http://www.sanger.ac.uk/genetics/CGP/Census/]
- Catalogue Of Somatic Mutations In Cancer. [http://www.sanger.ac.uk/genetics/CGP/cosmic/]
- OMIM. [http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM]
- Miyata T, Yasunaga T: Molecular evolution of mRNA: a method for estimating evolutionary rates of synonymous and amino acid substitutions from homologous nucleotide sequences and its application. J Mol Evol. 1980, 16: 23-36. 10.1007/BF01732067.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.