Evolutionary conservation and selection of human disease gene orthologs in the rat and mouse genomes
© Huang et al.; licensee BioMed Central Ltd. 2004
Received: 16 March 2004
Accepted: 28 May 2004
Published: 28 June 2004
Model organisms have contributed substantially to our understanding of the etiology of human disease as well as having assisted with the development of new treatment modalities. The availability of the human, mouse and, most recently, the rat genome sequences now permit the comprehensive investigation of the rodent orthologs of genes associated with human disease. Here, we investigate whether human disease genes differ significantly from their rodent orthologs with respect to their overall levels of conservation and their rates of evolutionary change.
Human disease genes are unevenly distributed among human chromosomes and are highly represented (99.5%) among human-rodent ortholog sets. Differences are revealed in evolutionary conservation and selection between different categories of human disease genes. Although selection appears not to have greatly discriminated between disease and non-disease genes, synonymous substitution rates are significantly higher for disease genes. In neurological and malformation syndrome disease systems, associated genes have evolved slowly whereas genes of the immune, hematological and pulmonary disease systems have changed more rapidly. Amino-acid substitutions associated with human inherited disease occur at sites that are more highly conserved than the average; nevertheless, 15 substituting amino acids associated with human disease were identified as wild-type amino acids in the rat. Rodent orthologs of human trinucleotide repeat-expansion disease genes were found to contain substantially fewer of such repeats. Six human genes that share the same characteristics as triplet repeat-expansion disease-associated genes were identified; although four of these genes are expressed in the brain, none is currently known to be associated with disease.
Most human disease genes have been retained in rodent genomes. Synonymous nucleotide substitutions occur at a higher rate in disease genes, a finding that may reflect increased mutation rates in the chromosomal regions in which disease genes are found. Rodent orthologs associated with neurological function exhibit the greatest evolutionary conservation; this suggests that rodent models of human neurological disease are likely to most faithfully represent human disease processes. However, with regard to neurological triplet repeat expansion-associated human disease genes, the contraction, relative to human, of rodent trinucleotide repeats suggests that rodent loci may not achieve a 'critical repeat threshold' necessary to undergo spontaneous pathological repeat expansions. The identification of six genes in this study that have multiple characteristics associated with repeat expansion-disease genes raises the possibility that not all human loci capable of facilitating neurological disease by repeat expansion have as yet been identified.
Human gene mutations resulting in specific disease phenotypes were first reported in the scientific literature over 50 years ago [1, 2]. Since then, protein and nucleotide sequence changes associated with human disease have accumulated at a rapid rate. A large body of literature has appeared on human disease-associated mutations, normal sequence variation, and alterations that acquire pathological significance when combined with other deleterious alleles or second-site mutations. With this information compiled into organized databases [3, 4], it is now possible to conduct large-scale, comprehensive analyses of human disease genes. Such studies acquire additional discriminatory power with the availability of multiple genome sequences from model organisms, as comparative studies can provide novel evolutionary insights into the selective relevance of genetic changes. In the present study, we have used a collection of nearly 1,200 human disease gene sequences to perform a large-scale analysis of gene and sequence conservation.
Investigation of evolutionary rates among large sets of genes has become feasible with the availability of the genome sequences of human, mouse and rat [5–8]. The degree of selective pressure to which genes have been subjected is reflected by the ratio of KA, the number of non-synonymous substitutions per non-synonymous site, to KS, the number of synonymous substitutions per synonymous site . Hurst and Smith  have compared these ratios for 'essential' mammalian genes (that is, those that are lethal or infertile in genetic knock-out experiments) to those for genes that produce a viable and fertile phenotype when subject to genetic knock-out (non-essential genes). Using a sample size of 67 essential genes and 108 non-essential genes, these authors showed that essential genes manifested significantly lower KA/KS ratios than non-essential genes. Upon further analysis however, they found that immune-system genes, which have high KA/KS values, accounted for much of this effect since these loci were over-represented in the non-essential gene set. Analyses of evolutionary rates must therefore account for rate variation across different tissues.
Using a larger dataset (2,400 human-rodent orthologs and 834 rat-mouse orthologs) and EST information, Duret and Mouchiroud  observed that tissue-specific genes, on average, exhibited higher KA/KS ratios than genes expressed in most tissues (so-called 'housekeeping genes'). A more recent study of microarray data confirmed this finding, and demonstrated that much of this effect is explicable in terms of a correlation between a gene's tissue-specificity and the cellular localization of its encoded protein .
In this study, we have used genes predicted from the completed mouse, rat and human genomes, and a manually validated set of human disease genes. Our aims were three-fold. Firstly, we sought to determine whether human disease genes are collectively distinguishable, with respect to evolutionary conservation and evolutionary rates, from non-disease genes. Then we investigated whether genes ascribed to different pathophysiological systems exhibit significant differences in evolutionary rates. The results promise to be relevant for the consideration of different types of animal models utilized to investigate the mechanisms of human disease. Finally, we considered the category of human disease genes harboring expansions of trinucleotide repeats. For these genes, a moderate number of repeats is usually compatible with a normal phenotype, whereas further expansions are frequently associated with a neurological disease phenotype. We studied polyglutamine repeats in rat, mouse and human orthologous sequences to obtain an evolutionary perspective on the mechanisms of glutamine-repeat generation.
Results and discussion
Conservation of human disease genes in the rat genome
Chromosome distribution of HGMD disease gene set
Ensembl gene number
Disease gene percentage
Of these 1,112 Ensembl genes, 844 (76%) were found to have orthologous genes in the rat genome (November 2002 assembly), according to Ensembl . This is a significantly greater proportion of Ensembl 1:1 orthologs than is found for the set of all Ensembl genes (46%). One reason for this difference might be a higher fidelity of predicting human disease genes and their rodent orthologs from their genome sequences, compared with other genes. This would be a consequence of the greater availability of transcript evidence, principally cDNA sequences, for disease genes. Imperfections in gene prediction and genome sequence assembly are the major hindrances to accurate orthology prediction.
We next wished to determine whether any of the 268 remaining human disease genes lacked a rat ortholog, perhaps as a consequence of either pseudogene creation or gene deletion in the rat lineage. Bearing in mind that certain regions of the rat genome sequence remain incomplete, that gene and orthology predictions are inexact, and that additional sources of sequence information are available, each of the remaining 268 human genes were aligned against the rat genome (assembly versions 2.0 and 3.1), EST, cDNA and protein sequences using BLAT  and BLAST . Using the methods employed, we were able to assign orthology even if gene-duplication events had occurred within the human or rat lineages. Detailed inspection of alignments indicated that only six human disease genes appear to have no orthologous counterparts among available rat sequences.
Of the six missing orthologs, three with known function were found to be present in mouse sequences: orthologs of human genes HLXB9 (homeobox gene HB9), SGSH (N-sulfoglucosamine sulfohydrolase) and GP6 (glycoprotein VI, platelet) with LocusLink identifiers 3110, 6448 and 51206, respectively. Hence, these genes might yet be found in the portion of the rat genome that still remains to be sequenced. Two of the three genes missing from both mouse and rat appear to have become pseudogenes relatively recently given that there are known hamster orthologs [16, 17]. These include cholesteryl ester transfer protein (CETP), which is associated with the deficiency in CETP activity described in rat and mouse  and Fuc-TIII (FUT3), an α-(1-3)-fucosyltransferase involved in the synthesis of milk oligosaccharides . A third gene, KAL1 (encoding the Kallman syndrome, or anosmin-1, protein) is entirely absent from the sequenced portions of the rat and mouse genomes. However, rodent Kal-1 genes may yet be found in the pseudoautosomal regions of their genomes . Kal-1 is present in Caenorhabditis elegans , amphibians, fish and birds, and its rat and mouse orthologs have been reported to be detectable using an antibody to the human KAL1 gene product .
Thus, of the 1,112 human disease genes examined, evidence that all are represented as functional genes in the rat genome was found, except for the six genes discussed above. Clearly, the set of genes identified as being associated with inherited disease in humans is highly conserved in the rat genome.
Mapping human disease mutations to rat genes
We compared sequence variants that result in human inherited disease with amino-acid substitutions that have accumulated since the common ancestor of rat and human. In all, 12,549 missense mutations were mapped to codons in the pairwise alignments of Ensembl human and rat 1:1 orthologs. As expected, the majority (89.6%) of these sites contain the same amino acid in both human and rat wild-type sequences. This exceeds the 82.2% of all sites that are identical for all 1:1 ortholog pairs , indicating that such sites are subject to a greater degree of purifying selection. Of the remaining 10.4% of sites, 4.6% were unable to be aligned with precision, while 4.9% exhibited amino-acid substitutions in the rat ortholog that differed from the human disease missense mutations.
Instances where a substituting amino acid in a human disease mutation is identical to the wild-type sequence of the rat genome
Amino acid change
Autoimmune lymphoproliferative syndrome
Factor XII deficiency
Familial cold autoinflammatory syndrome
Glaucoma 1, open angle
Leber congenital amaurosis IV
Leukoencephalopathy with vanishing white matter
Macular degeneration, age related
Steroid-5 alpha-reductase deficiency
Nucleotide substitution rates
We considered whether the highly significant difference between the two KS distributions arose from a small number of outlier disease genes associated with high KS values. However this appears not to be the case since removal of the top 10% of data points in both datasets did not reduce the divergence of the two distributions (data not shown).
Recently, Smith and Eyre-Walker  calculated evolutionary rates for rat-human orthologs of 387 human disease genes and 2,024 non-disease genes, taken from the Duret and Mouchiroud  and Jimenez-Sanchez et al.  datasets, respectively. They noted that KA/KS and KS values were significantly elevated for disease genes compared with non-disease genes. Although the findings relating to KS are fully supported by our study, our results indicate only a modest difference between KA/KS distributions of disease, and non-disease, genes (human-mouse: P = 0.044; human-rat: P = 0.032), rather than the 24% difference (P < 0.0001) reported by Smith and Eyre-Walker . We attribute this difference to the variation that can arise from sampling error when smaller gene sets are employed.
One interpretation of the findings reported here is that substitutions at non-synonymous sites have indirectly affected silent substitution rates [11, 24]. However, such an effect is unlikely to be highly pronounced since the significance of the KS distributions' difference was several orders of magnitude higher than that of the KA/KS distributions. The finding rather suggests that human disease gene sequences, and their rat orthologs, have mutated faster than their non-disease counterparts. If so, then it would appear that disease genes differ from other genes in one respect: they are more frequently encoded in hypermutable genomic regions. One possibility that could account for an elevated mutation rate is if the disease gene set were to contain a disproportionately lower number of genes expressed in germ cells. This is because mutations in such genes might be expected to be more frequently repaired by transcription-coupled repair [25, 26]. Another possibility is that disease genes are more prevalent in genomic regions that suffer elevated mutation rates for other, as yet unknown, reasons. Certainly, neutral rates have been found to vary significantly between distinct genomic regions .
Human disease genes and pathophysiology-based disease systems
A sufficient number of human disease genes have now been characterized in adequate detail to permit grouping them by disease system categories for large-scale analysis. We therefore categorized 1,178 human disease genes according to which organ or pathophysiological system the disease best fitted with respect to a specific pathological variant (Additional data file 2). For example, adenosine-deaminase deficiency is caused by ADA gene mutations that reduce or eliminate enzyme function; these alterations result in frequently fatal severe combined immunodeficiency owing to the toxicity of the accumulating substrates, adenosine and 2'-deoxyadenosine. Considering both the nature of the mutation and the gene it alters, as well as the impact of the resulting disease, this gene is categorized under both metabolic- and immune-disease systems. Among the disease-geneset, 889 genes were categorized into a single disease system whereas 289 genes were categorized into two systems.
Selection mechanisms acting on human disease genes in the rat and mouse
We find that significant differences exist between the KA/KS ratio distributions for the different pathophysiological classes. For example, within the neurological-disease system, 95% of the genes were subject to purifying selection (KA/KS < 0.25). This is in contrast to immune-system disease genes where only 65% were found to exhibit such low rates. Thus among all pathophysiological categories, it would appear that the genes of the neurological-disease system have been constrained by purifying selection the most, whilst those of the immune system have been constrained the least. No genes in our study met the strict criterion for positive selection (gene-averaged KA/KS ratios > 1.0), although adaptive evolution is more likely to have occurred at single sites for genes with ratios closer to 1.
In contrast to the findings for KA, KS and KA/KS comparing complete disease and non-disease gene sets, no significant differences in KS were found among different disease systems despite known differences among tissues . The significant differences identified in KA/KS ratios are likely to reflect differences in non-synonymous substitution rates (KA) across different disease systems.
We next investigated whether these variations in KA/KS ratios either arose from associations between physiology and organs or tissues, or were due to intrinsic properties of the human disease gene set under study. We studied two sets of genes: 586 sequences that were retrieved from the Human Proteome Survey Database (HPSD)  using gene ontology (GO) terms associated with immune function; and 761 genes retrieved using GO terms with neurological associations. Rat, mouse and human 1:1:1 orthology relationships were available for 200 of these 586 'immune' category genes (Additional data file 3) and for 304 genes of the 761 'neurological' genes (Additional data file 4); orthologs of human disease genes were disallowed from these sets.
We found no significant difference between the human-rat or human-mouse KA/KS distributions of the human 'immune' and human 'immune disease' gene categories (P = 0.0897). We then combined the disease and non-disease immune gene sets and determined the KA/KS values of this larger immune-system gene set compared to a set that included 7,641 non-immune control genes that have 1:1:1 orthologous relationships in mouse, rat and human. This control experiment confirmed our initial findings that the genes involved in the human immune response contain, as a set, fewer members subject to purifying selection than controls not involved in immune function. Utilization of the larger gene set in this analysis reduces the likelihood that results derive from sampling error. We therefore conclude that elevation of KA/KS values for disease genes of the immune system is a general property of immunologically relevant genes, rather than being specific to immune-system disease genes. This conclusion is consistent with findings from studies demonstrating that lymphocyte- or thymus-specific genes evolve relatively rapidly [11, 12].
Conservation index differences by disease system for model organisms
Human genes with poly-glutamine repeat tracts expanded in the human lineage and not at present known to be associated with disease
CAGH3 transcription factor
DNA polymerase gamma subunit 1
Nuclear receptor coactivator 3
Retina-derived POU-domain factor-1
Retinoic acid induced 1 isoform 2
From these studies, we conclude that KA/KS ratios differ significantly for genes involved in either neurological or immune processes as compared with the set of all genes examined (See Materials and methods). Although less significant, differences in KA/KS ratio distributions were also identified for three other gene sets: the malformation syndrome, pulmonary and hematological categories. Genes in the neurological and malformation-syndrome systems display, on average, lower KA/KS ratios, whereas genes of the other three categories have, on average, higher KA/KS ratios. Thus, we conclude that the pathophysiological system differences we observe derive from organ, tissue and physiological characteristics rather than arising from properties unique to disease genes and their potential impact on fitness. Although these findings are not specific to genes associated with human disease, they could influence the selection of animal models used to investigate human disease.
Functional-annotation distribution by human disease system
We considered subsequently whether the KA/KS differences among pathophysiology system datasets described earlier arose from over- or under-representation of specific domains, functions or evolutionary families. For this we considered GO terms  and domain terms. Results indicated that domain names, or their annotations, did not account for differences among median KA/KS values among pathophysiological systems. Thus, although gene-family, domain, and other functional categories are non-uniformly distributed across the pathophysiology groups, their distribution is not the underlying cause of the KA/KS differences we observe.
Conservation of human disease genes in other model organisms
In addition to rodents, other animal models have also been extensively used in the study of human diseases (see review ). Given the utility and lower research costs for non-mammalian model organisms, we wished to determine the level of conservation of disease genes in these established models. We thus extended our analysis of rodent orthologs of human disease genes to a broader range of organisms including representative genomes from fish, nematode, fly and yeast.
Conservation metrics were selected for this analysis because comparisons among more distantly-related organisms typically identify multiple substitutions per site, disallowing calculation of KA/KS values from sequence pairs. We defined a conservation index (CI, also known as a score density) as the length-normalized amino-acid similarity between a sequence pair (see Materials and methods). We predicted the number of orthologs and quantiles of CI in each model organism species (Additional data file 5). We then compared CI in different disease systems for each of these organisms. Non-parametric methods were used to calculate the standardized score for each system in each organism similar to those applied in our previous analyses.
Thus, for the study of human diseases of immune and hematological systems, primate or human cell models would probably be most suitable. Rodent models are likely to be best suited for studies of genes in neurological, malformation-syndrome and metabolic categories. Neurological and metabolic genes are sufficiently well conserved that fly or fish models are appropriate, whereas the yeast and worm, in general, are perhaps best suited as models of metabolic diseases given the overall lower conservation found for other categories in our study.
These findings parallel previous studies [32, 33] that concluded that Drosophila is a good model organism for the study of genes in neurological and metabolic diseases, malformation syndromes and cancer. However, these studies based on 287 human disease genes categorized into 10 pathophysiological systems used the percentage of orthologs present per category to determine significance. By contrast, our study, with a substantially larger disease-gene set and more quantitative analysis, concluded that Drosophila is likely to have more limited utility as a model for the study of human cancer processes.
Amino-acid-repeat expansions associated with human disease
Glutamine expansion is associated with a number of different neurodegenerative disorders. In these diseases, long poly-glutamine tracts result from the expansion of CAG triplets by trinucleotide slippage. To obtain a general picture of poly-glutamine distribution and conservation in mammals, we compared poly-glutamine tracts in human-rat, human-mouse and rat-mouse ortholog pairs. We used aligned sequences to map equivalent (that is, orthologous) repeats and considered tandem repeats either of length 5 or longer, or of length 10 or longer ('very long repeats'). The two rodent species contained a slightly lower number of glutamine repeats than humans (85-88% of the number found in humans), in accord with the generally lower frequency of tandem amino-acid repeats in rodents as compared to humans . For very long glutamine repeats (more than 10 residues), we identified 40 repeats in human and 39 in mouse in the human-mouse comparison; 41 in human and 58 in rat, in the human-rat comparison; and, 83 in rat and 41 in mouse in the rat-mouse comparison. Thus, among very long repeats, an excess in human sequences was not detected and rat sequences contained more repeats than mouse sequences. The number of human glutamine repeats (repeat length 5 or longer) conserved in rat and in mouse was roughly 55% in both cases, slightly higher than the general human repeat conservation level in rodents (46.5% for rat and 52% for mouse).
The comparison of poly-glutamine length in the three mammalian species studied has shown that human disease genes that are associated with glutamine expansion are part of a larger group of genes likely to have experienced repeat expansions in the primate lineage. Examination of CAG and CAA codon repeats in this dataset confirms that lineage-specific glutamine repeats are associated with long CAG tracts whereas those conserved among different lineages tend to be encoded by a mixture of CAG/CAA codons . Comparisons of numbers of very long poly-glutamine repeats also indicate that the rate of glutamine expansion in the rodent lineage may be comparable to that in the human lineage, and that rat sequences may be particularly prone to accumulate long repeats, a desirable feature for transgenic models of triplet-repeat expansion-associated disease. Such models already exist for several such diseases [36–38]. This is in addition to the advantage of using the rat, as opposed to the mouse, as an animal model for the investigation of neurological disorders for in vivo imaging studies because of its larger brain size .
Almost all human disease genes have orthologous counterparts in rodent genomes. The set of these disease genes does not differ greatly from the set of other genes with respect to KA/KS ratios although significant differences in synonymous substitution rates (KS) were observed. This suggests that human disease gene sequences and their rat orthologs may have mutated faster (or may have been repaired less efficiently) than their non-disease counterparts. Although the two KS distributions are significantly different, there is considerable overlap between them; the median difference between disease and non-disease distributions (0.05) is significantly smaller than one standard deviation (0.20). This means that the KS value of a particular gene, by itself, is not likely to be a sufficient indicator of whether it is, or is not, associated with disease.
Rodent orthologs of the gene set associated with neurological function exhibit the greatest conservation and are primarily subject to purifying selection. The highest KA/KS ratios were observed for genes that function in the immune system indicating that these genes are under less purifying selective pressure. This finding would be expected if host-pathogen co-evolution drives divergence by pathogen specificity within species. If sequence divergence were to be coupled to functional divergence, then this could suggest that rodent models of human neurological disease are more likely to faithfully represent human disease processes than rodent models of immune disease. Rodent models of human diseases in the immune-, hematological- and pulmonary-system pathophysiological categories should thus be validated particularly carefully before extrapolating from rodent studies to human.
Investigation of repeat-expansion disease genes led to the observation that all rodent homologs of these human disease genes bear shorter poly-glutamine repeat lengths. Furthermore, glutamine repeats in the human disease genes are mostly encoded by long CAG tracts. Rat-mouse-human comparative analysis also identified a number of human genes that, although not known to be associated with disease, share the same repeat characteristics as human disease-associated genes. These genes should be further investigated as potential disease candidates; of special interest are the four for which EST evidence indicates gene expression in the brain. Spontaneous neurological diseases arising through repeat-expansion mutations have not been identified in either rat or mouse laboratory strains or in natural populations. This could be due to ascertainment bias of rare events in rodent colonies or it is also possible that these orthologs fail to achieve a 'critical repeat threshold' required to trigger these mutational mechanisms. With the current successful development of rodent transgenic models using human disease gene constructs, this possibility can now be directly investigated. It will also be instructive to define the normal variation of rodent repeat lengths in natural populations for these genes to determine whether the variation in repeat numbers associated with a normal phenotype parallels that observed for human.
Materials and methods
Validation of disease role and assignment of disease-system annotation
The development of well-curated gene sets is an essential step for genome-scale disease gene analysis. The starting point for the present study was the Human Gene Mutation Database (HGMD) (February 2003 release) . Beginning with 1,178 disease genes in this database, each gene was checked for at least one primary literature reference to confirm that it represented a bona fide gene in which a mutation had manifested an experimentally confirmed disease-association. During the annotation process, genes that did not meet this criterion were placed in an IDE category (insufficient disease evidence) but were not eliminated from the dataset. Thus, all genes that were placed into pathological categories were independently validated for disease association from the literature. Once a gene had passed this validation step, it was placed into one or more categories using the categorization method of Rubin et al.  with minor modifications. Thus each gene was assigned to one of the following categories: cancer, cardiovascular, endocrine, hematological, immune, malformation-syndrome, metabolic, neurological, pulmonary, renal or other. However, owing to the larger number of pathological systems represented for the genes categorized in this study, the following categories were added to those used in the Drosophila study : skin, bone-connective tissue, muscle, hepatic-GI-pancreatic, and nonneural eye. Annotation categories were combined where they overlapped functionally (for example, bone-connective tissue and hepatic-GI-pancreatic). Briefly, the annotation method was to read the disease gene entry in the Online Mendelian Inheritance in Man (OMIM) database  to see if the pathophysiology category was identified in the synopsis in sufficient depth to make an assignment. This assignment was then confirmed using standard medical texts covering internal medicine, pathology and infectious disease. Annotations were independently determined by at least two individuals and then all genes with discrepant annotations were reviewed by the annotation group. In some cases, two disease-system categories were assigned. For example, mutations in a number of enzyme-encoding genes produce human disease within a narrow pathophysiological area. Thus, both 'metabolic' and the pathophysiology system directly associated with disease would be selected.
Human:rat ortholog pair assignment and KA/KSdetermination
cDNA 'reference sequences' corresponding to the protein-coding entries in HGMD were mapped to NCBI build 31 of the human genome sequence  using BLAT  and an alignment identity lower threshold of 95%. HGMD disease entries were assigned Ensembl  human gene predictions if the optimal mapping of their cDNA sequences overlapped at least one Ensembl gene exon. Of the 11,522 1:1 rat:human orthologs identified by Ensembl, 11,224 (97.4%) were identified as syntenic with human and are accepted with confidence. The human disease genes in which orthologs were not predicted by Ensembl were individually investigated further using BLAT  and BLAST . Additional 1:1 orthology relationships were established or confirmed using rat genome, EST, cDNA and protein sequences on the basis of high amino-acid identity (most more than 80%). Given that the median amino-acid identity among Ensembl's syntenic human-rat orthologs is 88% , we are confident that the ortholog assignments utilized in this study are accurate. KA/KS and KS were calculated using the yn00 algorithm  implemented in PAML  and pairwise alignments of human and rat orthologs, described elsewhere .
Ortholog assignment and conservation index determination
Potential orthologs were searched for 1,180 disease genes in the rat, mouse, fish, nematode, fly and yeast genomes using the INPARANOID program . A CI was calculated as the alignment score in bits divided by the alignment length. The number of potential orthologs and quantiles of CI were determined for each species. Although percentage sequence identity could also have been used for this purpose, CI has the advantage of accounting for conservative substitutions. The INPARANOID program utilized BLAST2 to generate alignments and employed the blosom62 amino acid substitution matrix. Species utilized in this study and their on-line sources are: Rattus norvegicus , Mus musculus , C. elegans , Drosophila melanogaster , Saccharomyces cerevisiae  and Danio rerio .
Human disease mutation and rat wild-type genome-sequence comparison
Human and rat ortholog alignments were inspected automatically at positions described by HGMD as human disease mutations. From this, 104 amino-acid sites were found to be identical between the rat sequence and the proposed disease variant in humans. These were investigated further by review of the literature and the relevant HGMD entry. Questionable items (marked as '?' in HGMD), and those for which there was no documented statistical evidence for a causal connection between sequence variation and a clinical phenotype, were excluded, as were entries associated with poor alignment quality.
Text analysis of functional annotation by disease-system categories
Gene Ontology and domain terms were analyzed for over- or under-representation in different disease systems using the CoMet tool within the OmniViz analysis package . This analysis package analyzes associations between terms, categories, clusters and groups by examining the deviation of number of occurrences in a cell from that found in a random distribution; the resulting analyses are then visualized with the CoMet tool which facilitates an overview of correlations among a matrix of variables. Records meeting the criteria of: P value less than or equal to 0.05, expected frequency greater than or equal to 2%, observed frequency greater than or equal to 2%, and record number greater than or equal to 5. For this algorithm, the null hypothesis for a given GO entry would be: 'its frequency in a specific disease system is equal to its frequency in others'. Results are portrayed graphically with over-represented cells labeled in red. A gradient of color hue represents the deviation in value for each category.
Identification of repeat-variation disease gene set
Genes with disease-associated mutations characterized as bearing 'repeat variations' were retrieved from the HGMD database . The dataset included the nine different known CAG-expansion disease genes. Protein and cDNA sequences from humans, rat and mouse were obtained from the Ensembl database . In all, 11,501 human-rat sequence pairs, 12,488 human-mouse sequence pairs and 12,357 rat-mouse sequence pairs were aligned using CLUSTALW  and glutamine tandem repeats mapped on the aligned sequences. The length cut-off for considering repeats was five or more glutamine residues in tandem. Conserved repeats between two species were those with a length of 5 or longer in an equivalent position in the two sequences.
Additional data files
The following additional data are available with the online version of this article: the original set of 104 genes where rat wild-type sequence is identical to human disease variant mutation (Additional data file 1), the pathophysiology annotations for human disease genes (Additional data file 2), the list of immune-system genes not identified as disease genes (Additional data file 3), the list of neurological-system genes not identified as disease genes (Additional data file 4) and the potential orthologs of human disease genes identified for each model organism (Additional data file 5).
C.P.P., E.E.W. and L.G. are funded by the Medical Research Council UK. H.H., H.W., K.G.W., H.X., K.F. and D.R.S. were funded under grants HG002046 and HG002145 from the National Institutes of Health, USA. M.M.A. acknowledges program Ramón y Cajal and grant BIO2002-04426-C02-01 from the Spanish Ministry of Science and Technology. P.D.S. and D.N.C. acknowledge the support of Celera Genomics, Rockville, MD. We thank the Rat Genome Sequencing Consortium for valuable advice and support during this project.
- Pauling L, Itano HA, Singer SJ, Wells IC: Sickle cell anemia, a molecular disease. Science. 1949, 110: 543-548.PubMedView ArticleGoogle Scholar
- Ingram VM: Gene mutations in human hemoglobin: the chemical difference between normal and sickle cell hemoglobin. Nature. 1957, 180: 326-328.PubMedView ArticleGoogle Scholar
- Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, Thomas NS, Abeysinghe S, Krawczak M, Cooper DN: Human Gene Mutation Database (HGMD®): 2003 update. Hum Mutat. 2003, 21: 577-581. 10.1002/humu.10212.PubMedView ArticleGoogle Scholar
- Hamosh A, Scott AF, Amberger J, Bocchini C, Valle D, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2002, 30: 52-55. 10.1093/nar/30.1.52.PubMedPubMed CentralView ArticleGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.PubMedView ArticleGoogle Scholar
- Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al: The sequence of the human genome. Science. 2001, 291: 1304-1351. 10.1126/science.1058040.PubMedView ArticleGoogle Scholar
- Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-562. 10.1038/nature01262.PubMedView ArticleGoogle Scholar
- Gibbs RA, Weinstock GM, Metzker ML, Muzny DM, Sodergren EJ, Scherer S, Scott G, Steffen D, Worley KC, Burch PE, et al: Genome sequencing of the Brown Norway Rat yields insights into mammalian evolution. Nature. 2004, 428: 493-521. 10.1038/nature02426.PubMedView ArticleGoogle Scholar
- Hurst LD: The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet. 2002, 18: 486-10.1016/S0168-9525(02)02722-1.PubMedView ArticleGoogle Scholar
- Hurst LD, Smith NGC: Do essential genes evolve slowly?. Curr Biol. 1999, 9: 747-750. 10.1016/S0960-9822(99)80334-0.PubMedView ArticleGoogle Scholar
- Duret L, Mouchiroud D: Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol Biol Evol. 2000, 17: 68-74.PubMedView ArticleGoogle Scholar
- Winter EE, Goodstadt L, Ponting CP: Elevated rates of protein secretion, evolution, and disease among tissue-specific genes. Genome Res. 2004, 14: 54-61. 10.1101/gr.1924004.PubMedPubMed CentralView ArticleGoogle Scholar
- Clamp M, Andrews D, Barker D, Bevan P, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, et al: Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res. 2003, 31: 38-42. 10.1093/nar/gkg083.PubMedPubMed CentralView ArticleGoogle Scholar
- Kent WJ: BLAT-the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664. 10.1101/gr.229202. Article published online before March 2002.PubMedPubMed CentralView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Tsutsumi K, Hagi A, Inoue Y: The relationship between plasma high density lipoprotein cholesterol levels and cholesteryl ester transfer protein activity in six species of healthy experimental animals. Biol Pharm Bull. 2001, 24: 579-581. 10.1248/bpb.24.579.PubMedView ArticleGoogle Scholar
- Zhang A, Potvin G, Zaiman A, Chen W, Kumar R, Phillips L, Stanley P: The gain-of-function Chinese hamster ovary mutant LEC11B expresses one of two Chinese hamster FUT6 genes due to the loss of a negative regulatory factor. J Biol Chem. 1999, 274: 10439-10450. 10.1074/jbc.274.15.10439.PubMedView ArticleGoogle Scholar
- Gersten KM, Natsuka S, Trinchera M, Petryniak B, Kelly RJ, Hiraiwa N, Jenkins NA, Gilbert DJ, Copeland NG, Lowe JB: Molecular cloning, expression, chromosomal assignment, and tissue-specific expression of a murine alpha-(1,3)-fucosyltransferase locus corresponding to the human ELAM-1 ligand fucosyl transferase. J Biol Chem. 1995, 270: 25047-25056. 10.1074/jbc.270.42.25047.PubMedView ArticleGoogle Scholar
- Soussi-Yanicostas N, de Castro F, Julliard AK, Perfettini I, Chedotal A, Petit C: Anosmin-1, defective in the X-linked form of Kallman syndrome, promotes axonal branch formation from olfactory bulb output neurons. Cell. 2002, 109: 217-228. 10.1016/S0092-8674(02)00713-4.PubMedView ArticleGoogle Scholar
- Rugarli EI, Di Schiavi E, Hilliard MA, Arbucci S, Ghezzi C, Facciolli A, Coppola G, Ballabio A, Bazzicalupo P: The Kallmann syndrome gene homolog in C. elegans is involved in epidermal morphogenesis and neurite branching. Development. 2002, 129: 1283-1294.PubMedGoogle Scholar
- Gao L, Zhang J: Why are some human disease-associated mutations fixed in mice?. Trends Genet. 2003, 19: 678-681. 10.1016/j.tig.2003.10.002.PubMedView ArticleGoogle Scholar
- Smith NGC, Eyre-Walker A: Human disease genes: patterns and predictions. Gene. 2003, 318: 169-175. 10.1016/S0378-1119(03)00772-8.PubMedView ArticleGoogle Scholar
- Jimenez-Sanchez G, Childs B, Valle D: Human disease genes. Nature. 2001, 409: 853-855. 10.1038/35057050.PubMedView ArticleGoogle Scholar
- Hess ST, Blake JD, Blake RD: Wide variations in neighbor-dependent substitution rates. J Mol Biol. 1994, 236: 1022-1033. 10.1016/0022-2836(94)90009-4.PubMedView ArticleGoogle Scholar
- Green P, Ewing B, Miller W, Thomas PJ, Green ED, NISC Comparative sequencing Program: Transcription-associated mutational asymmetry in mammalian evolution. Nat Genet. 2003, 33: 514-517. 10.1038/ng1103.PubMedView ArticleGoogle Scholar
- Majewski J: Dependence of mutational asymmetry on gene-expression levels in the human genome. Am J Hum Genet. 2003, 73: 688-692. 10.1086/378134.PubMedPubMed CentralView ArticleGoogle Scholar
- Hardison R, Roskin KM, Yang S, Diekhans M, Kent WJ, Weber R, Elnitski L, Li J, O'Connor M, Kolbe D, et al: Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 2003, 13: 13-26. 10.1101/gr.844103.PubMedPubMed CentralView ArticleGoogle Scholar
- Van Eerdewegh P, Little RD, Dupuis J, Del Mastro RD, Falls K, Simon J, Torrey D, Pandit S, McKenny J, Braunschweiger K, et al: Association of the ADAM33 gene with asthma and bronchial hyperresponsiveness. Nature. 2002, 418: 426-430. 10.1038/nature00878.PubMedView ArticleGoogle Scholar
- BioKnowledge Library. [http://www.incyte.com/control/researchproducts/insilico/proteome]
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- Harihan IK, Haber DA: Yeast, flies, worms and fish in the study of human disease. N Engl J Med. 2003, 348: 2457-2463. 10.1056/NEJMon023158.View ArticleGoogle Scholar
- Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, Hariharan IK, Fortini ME, Li PW, Apweiler R, Fleischmann W, et al: Comparative genomics of the eukaryotes. Science. 2000, 287: 2204-2215. 10.1126/science.287.5461.2204.PubMedPubMed CentralView ArticleGoogle Scholar
- Fortini ME, Skupski MP, Boguski MS, Hariharan IK: A survey of human disease gene counterparts in the Drosophila genome. J Cell Biol. 2000, 150: F23-F30. 10.1083/jcb.150.2.F23.PubMedView ArticleGoogle Scholar
- Albà MM, Guigó R: Comparative analysis of amino-acid repeats in rodents and humans. Genome Res. 2004, 14: 549-554. 10.1101/gr.1925704.PubMedPubMed CentralView ArticleGoogle Scholar
- Albà MM, Santibáñez-Koref MF, Hancock JM: Conservation of polyglutamine tract size between mouse and human depends on codon interruption. Mol Biol Evol. 1999, 16: 1641-1644.PubMedView ArticleGoogle Scholar
- Klement IA, Skimmer PJ, Kaytor MD, Yi H, Hersch SM, Clark HB, Zoghbi HY, Orr HTL: Ataxin-1 nuclear localization and aggregation: role in poly-glutamine-induced disease in SCA1 transgenic mice. Cell. 1998, 95: 41-53. 10.1016/S0092-8674(00)81781-X.PubMedView ArticleGoogle Scholar
- Reddy PH, Williams M, Charles V, Garrett L, Pike-Buchanan L, Whetsell WO, Miller G, Tagle DA: Behavioural abnormalities and selective neuronal loss in HD transgenic mice expressing mutated full-length HD cDNA. Nat Genet. 1998, 20: 198-202. 10.1038/2510.PubMedView ArticleGoogle Scholar
- Van Horsten S, Schmitt I, Nguyen HP, Holzmann C, Schmidt T, Walther T, Bader M, Pabst R, Kobbe P, Krotova J, et al: Transgenic rat model of Huntington disease. Hum Mol Genet. 2003, 12: 617-624. 10.1093/hmg/12.6.617.View ArticleGoogle Scholar
- NCBI build 31 of the human genome sequence (November 2002). [http://hgdownload.cse.ucsc.edu/goldenPath/14nov2002/bigZips/]
- Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997, 13: 555-556.PubMedGoogle Scholar
- Yang Z, Nielsen R: Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol. 2000, 17: 32-43.PubMedView ArticleGoogle Scholar
- Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 2001, 314: 1041-1052. 10.1006/jmbi.2000.5197.PubMedView ArticleGoogle Scholar
- Rattus norvegicus. [http://hgdownload.cse.ucsc.edu/goldenPath/rnJan2003/bigZips/]
- Mus musculus. [http://hgdownload.cse.ucsc.edu/goldenPath/mmFeb2003/bigZips/]
- Caenorhabditis elegans. [ftp://ftp.wormbase.org/pub/wormbase/archive/wormpep98.tar.gz]
- Drosophila melanogaster. [ftp://ftp.ncbi.nih.gov/refseq/release/invertebrate]
- Saccharomyces cerevisiae. [ftp://ftp.ncbi.nih.gov/refseq/release/fungi]
- UniGene - Danio rerio. [ftp://ftp.ncbi.nih.gov/repository/UniGene/Dr.seq.uniq.gz]
- OmniViz. [http://www.omniviz.com]
- Thompson JD, Higgins DG, Gibson TJ: CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.