Nonsynonymous mutations in the coding regions of human genes are responsible for phenotypic differences between humans and for susceptibility to genetic disease. Computational methods were recently used to predict deleterious effects of nonsynonymous human mutations and polymorphisms. Here we focus on understanding the amino-acid mutation spectrum of human genetic disease. We compare the disease spectrum to the spectra of mutual amino-acid mutation frequencies, non-disease polymorphisms in human genes, and substitutions fixed between species.
We find that the disease spectrum correlates well with the amino-acid mutation frequencies based on the genetic code. Normalized by the mutation frequencies, the spectrum can be rationalized in terms of chemical similarities between amino acids. The disease spectrum is almost identical for membrane and non-membrane proteins. Mutations at arginine and glycine residues are together responsible for about 30% of genetic diseases, whereas random mutations at tryptophan and cysteine have the highest probability of causing disease.
The overall disease spectrum mainly reflects the mutability of the genetic code. We corroborate earlier results that the probability of a nonsynonymous mutation causing a genetic disease increases monotonically with an increase in the degree of evolutionary conservation of the mutation site and a decrease in the solvent-accessibility of the site; opposite trends are observed for non-disease polymorphisms. We estimate that the rate of nonsynonymous mutations with a negative impact on human health is less than one per diploid genome per generation.
Several recent studies [1–6] have applied computational methods to predict potentially deleterious effects of nonsynonymous single-nucleotide polymorphisms (SNPs) in humans. SNPs represent common human alleles, usually with population frequencies greater than 1%. Both structural and evolutionary methods were used to assess potential functional effects of SNPs. It was predicted that a substantial fraction (10-30%) of human SNPs may affect protein function negatively, although the medical consequences of these SNPs remain to be established.
The main goal of the work reported here is to characterize and rationalize the overall amino-acid spectrum of disease mutations and non-disease SNPs (referred to as 'benign SNPs' below). We obtain the relative probabilities that a random mutation (rather than an existing SNP) will cause a genetic disease while explicitly taking into account the underlying spectrum of nucleotide mutations. Such an approach will allow, in the future, the identification and characterization of highly mutable sites in the human genome which are also functionally important.
Miller and Kumar  performed a detailed analysis of the disease mutations and benign SNP spectra in seven human genes. While some of our results are consistent with their study, we find major differences. For example, we observe a significantly larger contribution of mutations at arginine (Arg) and glycine (Gly) to human genetic disease. We attribute the differences to the substantially larger gene set (436 genes versus 7) used in our analysis.
Results and discussion
Overall amino-acid mutational spectrum
We present the amino-acid spectra of disease mutations and polymorphisms in Figure 1. The mutations from the Mendelian Inheritance in Man (MIM) database  annotated in SWISS-PROT  were used as a source of human disease mutations. In total, 4,236 mutations from 436 genes were considered. The collection of 1,037 synonymous and nonsynonymous SNPs from the extensive analysis of haplotypes in 313 human genes  was used as a source of benign SNPs. There was no overlap between the disease mutations and benign sets of SNPs used in the study. The spectrum of interspecies substitutions (Figure 1d) was calculated on the basis of the PAM1 matrix  as described in Materials and methods.
Nearly all mutations in the current MIM database represent Mendelian disease (monogenic in etiology). It remains to be seen to what extent our results pertain to disease mutation involved in polygenic disorders. At this point, too little is known about this type of mutation, and more experimental work is required in order to understand their spectrum.
The mutation matrices in Figure 1 are sparse (that is, a large number of the matrix elements are close or equal to zero) and nonsymmetrical (in many cases the tendency of amino acid I to mutate into amino acid J is different from the tendency of amino acid J to mutate into I). The vast majority of human genetic mutations are caused by single-nucleotide changes [11, 12]. Consequently, the matrices in Figure 1b,c,d represent amino-acid transitions resulting predominantly from single-nucleotide mutations in amino-acid codons. To rationalize the observed disease and benign spectra, we generated the expected mutation spectrum (Figure 1a) using the neighbor-dependent matrix of nucleotide mutation rates developed by Hess et al.  (see Materials and methods). The expected mutation matrix in Figure 1a represents the spectrum which would be observed if all nonsynonymous mutations were accepted (that is, there were no selection). The expected spectrum was generated for the disease genes considered and, separately, for a large collection of more than 7,000 human genes available from SWISS-PROT. These two spectra were almost identical (R = 0.98, p < 0.0001), suggesting that the expected spectrum in Figure 1a reflects general properties of all human genes (such as amino-acid codon frequencies and context-dependent nucleotide mutation frequencies). Here and throughout the paper we use the t-test statistics with n-2 degrees of freedom to estimate the significance of linear correlations. Random shuffling simulations confirmed the significance values obtained using the t-test.
The spectrum of disease mutations was calculated separately for membrane proteins. The program TMHMM  was used to detect potential transmembrane regions. The disease spectrum for membrane proteins is very similar to the all-protein disease spectrum (R = 0.97, p < 0.0001 for all disease mutations in membrane proteins, 1,598 in total; R = 0.75, p < 0.0001 for disease mutations in transmembrane regions, 372 in total). Evidently, specific properties of membrane proteins and the constraints on them are not able to significantly modify the disease spectrum common to all proteins.
Correlations between the expected and the observed spectra
Close-to-diagonal mutations in Figure 1a,b,c,d represent substitutions between amino acids with similar chemical properties (conservative mutations). The interspecies substitutions (Figure 1d) contain the highest fraction of conservative mutations compared to disease mutations and benign SNPs. The frequencies of the benign SNPs, disease mutations, and interspecies substitutions are plotted versus expected frequencies in Figure 2. Benner et al.  showed that the genetic code affects the amino-acid substitution spectrum at early stages of divergence, whereas chemical similarities dominate at longer evolutionary distances. The correlation between the benign and expected spectra observed in our study (R = 0.78, p < 0.0001), is an expected extension of Benner's et al. conclusion to even shorter evolutionary distances (variations within a population).
Interestingly, we also find a strong correlation of the disease mutation spectrum with the expected spectrum based on the genetic code (R = 0.71, p < 0.0001). The correlation of disease mutation frequencies with the chemical dissimilarities between original and mutant amino acids is apparent only after normalization by the expected frequencies (Figure 3a,b). Consequently, in the majority of cases the comparison of amino-acid types (wild type versus mutant) will be insufficient to distinguish neutral from disease variants.
The contribution of mutations at different amino acids to the disease spectrum is highly heterogeneous (Figure 4). Interestingly, mutations at Arg residues account for almost 15% of the disease mutations. This is a direct consequence of the well-known high mutability of Arg (due to deamination of 5'-CpG dinucleotides in Arg codons) [16, 17], the relatively high frequency of Arg in human proteins (<4%), and the fact that Arg mutates to residues with very different chemical properties (cysteine (Cys), glycine (Gly), histidine (His), lysine (Lys), leucine (Leu), methionine (Met), proline (Pro), flutamine (Gln), serine (Ser) and tryptophan (Trp)). The relative probability of a disease mutation at different amino acids (Figure 4b) was calculated by dividing the disease and expected frequencies. Accordingly, a random mutation at a Trp or Cys residue has the highest probability of causing a disease. This correlates well with the highest evolutionary conservation of exactly these two residues . Both Trp and Cys residues play a prominent part in determining protein stability. In addition to Trp and Cys, the high probability of disease mutations at Gly may be related to important structural roles often played by this residue. For example, mutations at Gly, which is frequently present at the turns of alpha-helices, might have a negative impact on protein structural stability. Our definition of the relative probability of disease mutations is similar to the relative clinical observation likelihood (RCOL) used by Cooper et al. in several publications (see, for example ). In the next section we extend the relative probabilities to interspecies comparisons.
Probabilities of mutations or SNPs as a function of the mutation/SNP-site properties
To complement the analysis of the amino-acid mutation matrices, we investigated how the probabilities of benign SNPs and disease mutations depend on the properties of the mutation site. Several recent studies have focused on developing evolutionary and structural approaches to predict potentially deleterious human mutations [1–4, 18, 19]. Here, we focus on understanding the relative mutation probabilities (see Materials and methods). Our results are in general agreement with the previous studies. The relative probabilities of disease mutations and benign SNPs are shown in Figure 5a as a function of the interspecies evolutionary conservation of the mutation site. The conservation was characterized by the relative entropy measure using homologs with more than 30% sequence identity. The probability that a random mutation will cause a genetic disease increases monotonically with an increase in the degree of site conservation, while the probability of observing nonsynonymous benign SNPs shows the opposite trend. The synonymous benign SNPs do not change amino acids and should be predominantly neutral. As a result, their probability is uniform across sites.
The solvent accessibility of an amino-acid residue in a protein reflects the degree of the residue's exposure to the surrounding solvent in the protein structure. The relative probability of disease-causing mutations is highest in the protein interior and lowest on the protein surface (Figure 5b). The benign SNPs show the reverse trend, as their relative probability is highest on the surface and lowest in the protein interior. This is consistent with the study by Moult and co-workers  (see also Ferrer-Costa et al.  and Bustamente et al. ), who suggested that the dominant mechanism by which disease mutations damage protein function is a decrease in protein stability, as opposed to mutations of active-site residues (usually located on the protein surface).
Both relative entropy and solvent accessibility exclusively characterize the site of a mutation. To estimate the extent to which a given amino acid is incompatible with the residues observed at the same position in close homologs, we introduced the Grantham Ratio (GR) score based on the Grantham dissimilarity matrix  (see Materials and methods for a formal definition). Application of other scores, for example those based on the BLOSUM matrices, gave qualitatively similar results . The GR score is the ratio of two averages - the numerator being the average dissimilarity between the mutated amino acid and the residues observed at the same site in evolution, and the denominator being the average dissimilarity within the residues observed at the site in homologous proteins. Defined in this way, a GR score smaller or close to 1 suggests that the amino acid is similar to the residues observed at the site in evolution, whereas a GR score significantly larger than 1 indicates that the amino-acid change is evolutionarily radical.
The role of purifying selection in shaping the mutation spectra is apparent from the cumulative distribution of the GR scores (Figure 6). Whereas the GR distribution for original (wild type) residues at benign sites (blue curve) is very similar to the distribution for all protein residues (black), the distribution for mutant residues at benign sites (green) clearly shows an excess of radical mutations. Importantly, the GR distribution of mutant residues at benign sites (green) is similar to the distribution for randomly generated mutations (cyan) and is quite different from the disease mutation distribution (red). Consequently, although a significant fraction of randomly arising nonsynonymous mutations are evolutionarily radical (and thus potentially deleterious) they are not, on average, as radical as the disease mutations and still have appreciable frequencies in the human population. Indeed, it was recently estimated  that the average reduction in evolutionary fitness due a mildly deleterious SNP with a significant frequency in the human population is in the range of 0.01-1%. The medical importance of such mildly deleterious human mutations remains to be established [24, 25].
The cumulative distribution of the GR scores for disease mutations suggests that more than a half of the disease mutations are evolutionarily radical (represented by residues with GR score greater than 2). Residues with such GR scores are almost never observed in homologous sequences (blue and black curves). It is important to note that medically damaging mutations and SNPs cannot always be rationalized in terms of evolutionary radicality. Medically harmful mutations may cause late-onset human diseases without strong selection in evolution. Alternatively, a particular amino-acid substitution can be damaging to a human protein but be relatively frequent in the homologous family due to compensatory mutations. Such substitutions may account for deleterious mutations with low GR scores.
Estimation of the maximal rate of mutations with impact on human health
From Figure 6 we can estimate the maximum rate of random mutations with significant impact on human health (that is, an impact similar to mutations currently annotated in MIM). We note that the mutation rate we estimate (a fraction of newly created deleterious mutations) is different from the fraction of existing SNPs with deleterious effects on protein function (estimated previously [1, 2, 23]). The comparison between the distribution of random SNP mutations (cyan) and disease mutations (red) suggests that about 10% of the randomly generated mutations have GR scores greater than 6. Such a score corresponds to approximately 40% of the disease mutations. As a result, the total rate of the disease mutations cannot be larger than one quarter of the random mutation rate. Thus, one expects, at most, 25% of random nonsynonymous mutations to be as damaging as mutations currently annotated in MIM (similar estimates are obtained using GR cutoffs larger than 6).
This estimate has a simple biochemical rationale, as mutagenesis experiments on different proteins suggest that less than 30% of random mutations substantially damage biological function or stability of proteins [26–29].
Using the recent estimate of the human mutation rate of 175 mutations per diploid genome per generation  (corresponding to approximately two to three nonsynonymous mutations), we conclude that the rate of nonsynonymous mutations with serious impact on human health should be less than one per diploid genome per generation. This is probably a substantial overestimation of the rate because we assume that all human genes are as important for human health as the well-annotated disease genes currently in the MIM database. We emphasize that the rate of health-damaging nonsynonymous mutations is smaller than the total rate of deleterious human mutations, which is estimated to be larger than one [30, 31].
The present analysis, together with other recent studies [1–4, 23], establishes the basis for understanding the spectrum of deleterious human mutations. The amino-acid substitution matrices, such as PAM  and BLOSUM , apart from playing a fundamental role in sequence alignment, qualitatively characterize the evolutionary interchangeability of amino acids averaged over many protein families. The disease spectrum, characterized by our analysis, explores another important aspect of evolution, namely the generation of deleterious mutations. Because of all mammalian species have a broadly similarity physiology, the properties of the disease spectrum should be general, at least for mutations leading to early-onset diseases. We anticipate that understanding the disease spectrum will allow one to predict, in advance, the rates and potential medical consequences of all possible single-nucleotide mutations in the human genome.
Materials and methods
Calculation of mutation spectra
The spectrum of expected amino-acid mutation frequencies (Figure 1a) was generated using the matrix of neighbor-dependent nucleotide mutation rates obtained by Hess et al.  (Additional data file 1). The neighbor-dependent mutation matrix was calculated by Hess et al. on the basis of 20,200 substitutions in aligned gene/pseudogene human sequences; the relative mutation rates were calculated for the four nucleotides in all 16 possible 5' and 3' neighborhoods. To obtain the expected amino-acid mutation frequencies for a given collection of genes, we simulated all possible single-nucleotide mutations with appropriate rates, and recorded the corresponding amino-acid changes. The nucleotide mutational spectrum of individual genes may be affected by the presence of so-called mutation hot spots [33–35]. However, on average, there is only a small influence of the surrounding DNA sequence (beyond nearest 5' and 3' neighbors) on the relative nucleotide mutation rates .
The interspecies spectrum of amino-acid mutation frequencies (Figure 1d) was calculated on the basis of Dayhoff's PAM1 matrix. The original PAM1 matrix  gives the probabilities of amino-acid substitutions over small evolutionary distances. These probabilities were multiplied by the amino-acid frequencies in human genes for direct comparison with the expected, disease, and benign SNPs matrices.
Structural and evolutionary analysis of mutations
The list of disease genes obtained from SWISS-PROT was filtered using the program PSEG  to exclude genes with a significant fraction of low-complexity regions. As a result of the filtering, six genes for collagen proteins were excluded from the original set of 436 genes. Mutations at Gly residues constitute more than 50% of the collagen disease mutations (due to the collagen structural motif). Because of this bias, the collagen mutations were excluded from all calculations. If the collagen mutations are included, the total fraction of disease mutations at Gly (Figure 4a) increases from 12% to 15%.
Membrane proteins and transmembrane protein regions were detected using the program TMHMM  with standard parameters. Out of 430 disease genes, 105 (24%) were classified as membrane proteins on the basis of the presence of at least two distinct transmembrane domains. To characterize the evolutionary conservation of mutation sites we used BLASTGP to search the nrdb90 database  for homologs with greater than 30% sequence identity. The nrdb90 database constitutes a nonredundant merge of sequence and structural databases, which is filtered so that no pair of sequences has greater than 90% sequence identity. The homologs to each human protein were subsequently aligned using the program CLUSTALW  with default parameters. Only mutation sites covered by more than 10 homologous sequences (excluding gaps) were used in the evolutionary analysis. The multiple sequence alignments obtained using CLUSTALW were used to characterize the relative entropy (Kullback-Leibler distance) of the benign and disease mutation sites. The relative entropy was calculated according to the formula:
where the summation is over all amino-acid types n in the alignment; P(n) is the probability of the amino acid n in the column corresponding to mutation; Q(n) is the probability of the amino acid n in all columns of the multiple sequence alignment.
The multiple sequence alignments were also used to calculate the Grantham ratio (GR) score according to the formula:
where D(A,B) is the Grantham measure of chemical dissimilarities between amino-acid residues A and B, Human_RES is the human residues at the mutation site, RES(i) is the amino acid from the ith aligned sequence homolog at the mutation site, and n is the number of aligned sequences. Qualitatively, the GR score is a measure of dissimilarity between a human amino acid and the residues seen at the same site in homologs. In total, the relative entropy and Grantham ratio were calculated for 258 benign SNPs and 2,636 disease mutations.
To characterize the structural location of disease mutations and benign SNPs, BLASTGP  was used to search the Protein Data Bank (PDB)  for sequences homologous to known structures. Only sequences with greater than 30% identity to human sequences over the entire length of the alignment were considered. In total, the solvent accessibilities were calculated for 110 benign SNPs and 840 disease mutations. The solvent accessibility of mutation sites was determined by the program NACCESS  using the water-sphere radius of 1.4 Å. The solvent accessibility represents the relative exposure of a residue X in a protein structure compared to its exposure in the tripeptide Ala-X-Ala.
Calculation of relative mutation probabilities
The relative mutation probabilities shown in Figures 4b, 5a, and 5b represent conditional probabilities. Specifically, the conditional probability P(disease|descriptor), that a mutation will cause a genetic disease given a certain property (descriptor) of the mutation site was calculated according to the formula:
where 'descriptor' represents solvent accessibility or evolutionary conservation of the mutation site, P(descriptor|disease) is the probability that a disease mutation has a given descriptor value, P(descriptor) is the probability that a random mutation (disease or non-disease) has a given descriptor value, and P(disease) is the probability that a random mutation will cause a genetic disease. Importantly, because P(disease) is unknown, we can only estimate P(disease|descriptor) up to a constant (assuming certain P(disease) value). Consequently, we refer to P(disease|descriptor) as relative mutation probabilities. The probability that a random mutation has a given descriptor value P(descriptor) was estimated by simulating random single-nucleotide mutations using the expected amino-acid mutation frequencies (Figure 1a).
Additional data files
The following additional data are included: a list of relative mutation rates (Additional data file 1), a list of disease mutations (Additional data file 2), a list of disease mutation genes (Additional data file 3), a list of SNPs used in the analysis (Additional data file 4), and the Grantham ratio scores (Additional data file 5).
Wang Z, Moult J: SNPs, protein structure, and disease. Hum Mutat. 2001, 17: 263-270. 10.1002/humu.22.
Terp BN, Cooper DN, Christensen IT, Jorgensen FS, Bross P, Gregersen N, Krawczak M: Assessing the relative importance of the biophysical properties of amino acid substitutions associated with human genetic disease. Hum Mutat. 2002, 20: 98-109. 10.1002/humu.10095.
Stephens JC, Schneider JA, Tanguay DA, Choi J, Acharya T, Stanley SE, Jiang R, Messer CJ, Chew A, Han JH, et al: Haplotype variation and linkage disequilibrium in 313 human genes. Science. 2001, 293: 489-493. 10.1126/science.1059431.
Halushka MK, Fan JB, Bentley K, Hsie L, Shen N, Weder A, Cooper R, Lipshutz R, Chakravarti A: Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat Genet. 1999, 22: 239-247. 10.1038/10297.
Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, Patil N, Shaw N, Lane CR, Lim EP, Kalyanaraman N, et al: Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet. 1999, 22: 231-238. 10.1038/10290.
Ferrer-Costa C, Orozco M, de la Cruz X: Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties. J Mol Biol. 2002, 315: 771-786. 10.1006/jmbi.2001.5255.
Lohmueller KE, Pearce CL, Pike M, Lander ES, Hirschhorn JN: Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat Genet. 2003, 33: 177-182. 10.1038/ng1071.
Templeton AR, Clark AG, Weiss KM, Nickerson DA, Boerwinkle E, Sing CF: Recombinational and mutational hotspots within the human lipoprotein lipase gene. Am J Hum Genet. 2000, 66: 69-83. 10.1086/302699.
Bernstein FC, Koetzle TF, Williams GJB, Meyer EF, Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M: The Protein Data Bank: A computer based archival file for macromolecular structures. J Mol Biol. 1977, 112: 535-542.
We thank Jay Shendure, John Aach, Patrik D'haeseleer, Daniel Segre, Peter Kharchenko, and Tzachi Pilpel for discussions. This work was supported in part by research grants from the US Department of Energy through the grant DOE DE-FG02-87-ER60565.
Present address: Computational Biology Center, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, New York, NY, 10021, USA
Authors and Affiliations
Lipper Center for Computational Genetics and Department of Genetics, Harvard Medical School, Boston, MA, 02115, USA
Dennis Vitkup & George M Church
Whitehead Institute for Biomedical Research, Nine Cambridge Center, Cambridge, MA, 02142, USA