Prediction of unidentified human genes on the basis of sequence similarity to novel cDNAs from cynomolgus monkey brain
© Osada et al., licensee BioMed Central Ltd 2001
Received: 14 September 2001
Accepted: 7 November 2001
Published: 19 December 2001
The complete assignment of the protein-coding regions of the human genome is a major challenge for genome biology today. We have already isolated many hitherto unknown full-length cDNAs as orthologs of unidentified human genes from cDNA libraries of the cynomolgus monkey (Macaca fascicularis) brain (parietal lobe and cerebellum). In this study, we used cDNA libraries of three other parts of the brain (frontal lobe, temporal lobe and medulla oblongata) to isolate novel full-length cDNAs.
The entire sequences of novel cDNAs of the cynomolgus monkey were determined, and the orthologous human cDNA sequences were predicted from the human genome sequence. We predicted 29 novel human genes with putative coding regions sharing an open reading frame with the cynomolgus monkey, and we confirmed the expression of 21 pairs of genes by the reverse transcription-coupled polymerase chain reaction method. The hypothetical proteins were also functionally annotated by computer analysis.
The 29 new genes had not been discovered in recent explorations for novel genes in humans, and the ab initio method failed to predict all exons. Thus, monkey cDNA is a valuable resource for the preparation of a complete human gene catalog, which will facilitate post-genomic studies.
In 2001, it was announced that most of the human genome had been sequenced and that the complete sequence would be determined by 2003 [1,2]. As the first step in decoding the entire sequence of the human genome, we must identify the protein-coding regions in the human genome sequence. Predicting the genes in the human genome is one of the most substantial applications of the sequence data, but is also the most difficult. Prediction of protein-coding genes by computer algorithm (ab initio prediction) is accurate to some extent in organisms whose genome sequence has already been determined, for example the fruit fly Drosophila melanogaster and the nematode Caenorhabditis elegans. Similar attempts to predict human genes, however, have met with limited success because small exons are generally separated by long introns. In the fly, ab initio methods can correctly predict around 90% of individual exons, and all the coding exons in a gene in about 40% of genes , in contrast to about 70% and 20%, respectively, in humans [4,5]. Thus, prediction of genes in humans requires further experimental evidence, for example, from cDNAs and expressed sequence tags (ESTs) . However, the uneven expression of various transcripts makes it difficult to isolate the cDNA of genes expressed as a very small proportion of the total transcripts, and it has not been possible to completely eliminate artifacts arising from cDNAs and ESTs derived from genomic DNA or partially spliced mRNAs [7,8]. Moreover, EST sequences are less informative about the boundary of an individual gene. It is also difficult to predict single-exon genes encoding small proteins, because it is impossible to determine easily whether the short open reading frames (ORFs) are actually translated into protein. In our previous study, we isolated approximately 20,000 clones from oligo-capped cDNA libraries of parts of the cynomolgus monkey brain (parietal lobe and cerebellum) and sequenced their 5' ends. We subsequently determined the entire sequence of 118 novel cDNAs, whose human orthologous cDNAs had not been entered in the public databases . The cynomolgus monkey (Macaca fascicularis) is one of the species of Macaca, an Old World monkey. On the basis of DNA sequence comparison complemented by fossil evidence, the divergence of humans and Old World monkeys is estimated at about 25 million years ago . On the basis of nucleotide-sequence similarity of a 10.9 kb globin genomic region in the previous study, the difference in nucleotide sequence between human and Macaca is 7% .
By using cynomolgus monkey brain, we were able to reduce the degradation of the mRNA, which is very fragile, during the construction of cDNA libraries, because, unlike human brain tissue, brain tissue can be removed intentionally from the anesthetized monkey and frozen immediately after extirpation. Since that study, we have isolated an additional 12,000 clones and determined the entire sequence of 673 novel cDNAs from frontal lobe, temporal lobe and medulla oblongata of the cynomolgus monkey brain. Moreover, comparison between the cynomolgus monkey cDNA sequence and the human genome draft sequence makes it possible to distinguish ORFs that actually encode proteins from spurious ORFs, because the high genomic conservation between human and Macaca means that protein-coding ORFs are likely to be maintained between the human and monkey sequences. In this study, we have predicted unidentified human genes from the human genome draft sequence by referring to cDNA sequences of cynomolgus monkey and have experimentally confirmed the expression of most of these genes. This method allowed us to identify novel human genes that had eluded other recent exploratory studies.
Results and discussion
In our previous study , we constructed two oligo-capped cDNA libraries of the cynomolgus monkey brain (cerebellum cortex, QccE; parietal lobe, QnpA). Subsequently, we constructed three more brain libraries from the frontal lobe (QflA), temporal lobe (QtrA) and medulla oblongata (QmoA), and sequenced the 5' ends of approximately 12,000 clones isolated from the three libraries. We then determined the entire sequence of 673 clones whose 5'-end sequences showed no significant similarities to sequences in the GenBank nr or EST databases and deposited them in the DDBJ/EMBL/GenBank nucleotide database (accession numbers: AB055250-055381, AB056322-056432, AB056799-056847, AB060202-060263, AB062934-063100, AB066511-066549). From these clones, we selected 90 that carried a putative coding region longer than 300 bp and showed no homology to mRNA sequences in the GenBank database by BLAST search (cut-off value: 1e-90), except for minimal overlap (less than 30% coverage of ORF). The 90 novel cDNAs are listed on our website . Next, we tried to identify the sequences of the human genome sequence (1 April 2001 data) that corresponded to the novel cynomolgus monkey cDNAs by using the program BLAT at the University of California at Santa Cruz (UCSC) website . The search yielded 78 cynomolgus monkey clones with orthologous sequences in the human genome (more than 90% similarity), but 14 of 78 clones showed partial matches with the genome sequence, making it impossible to obtain the entire sequence of the human orthologs. The remaining 12 clones that did not match any sequences in the human genome are probably located in the genomic region missing in the current draft sequence.
Summary of the 29 new genes
Monkey accession number
Cynomolgus monkey (Macaca) clone
Length (amino acids) in Macaca‡
Length (amino acids) in human§
Number of exons¶
Functional annotation (putative)**
Leucine-rich repeat protein
Serine/threonine protein kinase
Glycoside hydrolase family 31
Cysteine-rich protein with leucine-rich repeat
G-protein signaling protein
Zinc-finger protein with KRAB domain
RNA-binding methyl transferase
Interestingly, two genes showed a clearly different pattern of expression in humans and the cynomolgus monkey. Using the first primer set designed, the cDNAs of QtrA-10522 and QtrA-13256 appeared to be expressed only in monkey brain. However, with a primer set covering another pair of exons, expression of these genes in human and cynomolgus monkey was observed in both clones. These observations indicated some differences between humans and the cynomolgus monkey in the splicing pattern of the genes. The cynomolgus monkey cDNA for both genes carried additional exons that were spliced as introns in humans.
We could not amplify expected RT-PCR products of ORFs of eight genes. However, the conservation of ORFs between humans and the cynomolgus monkey strongly supported their being functional regions. Therefore, it is possible that the eight genes could not be amplified by RT-PCR because of technical problems, such as high GC content or inhibitory structure in the targeted region, or because the amount of transcript was too small to detect in the total RNA that we used for RT-PCR .
We have constructed 29 hypothetical human cDNAs on the basis of sequence similarity to novel cynomolgus monkey cDNAs, and comparisons between human and monkey sequences allowed us to select the cDNAs carrying protein-coding ORFs with high accuracy. These novel genes had not been discovered by recent explorations for novel genes in humans, and the ab initio method failed to predict all of their exons. They were also functionally annotated by computer analysis. Thus, the cDNA of a closely related monkey is a valuable resource for preparing a complete human gene catalog, which will facilitate future post-genomic studies, such as DNA microarray or proteome analysis.
Materials and methods
Cynomolgus monkey tissues
Tissue was collected from a 21-year-old male cynomolgus monkey. This monkey was cared for and handled according to guidelines established by the Institutional Animal Care and Use Committee of the National Institute of Infectious Diseases (NIID) of Japan and the standard operating procedures for monkeys at the Tsukuba Primate Center, NIID, Tsukuba, Ibaraki, Japan. Extirpation of the tissues was conducted in accordance with all guidelines required in the Laboratory Biosafety Manual, World Health Organization, and was carried out in the P3 facility for monkeys at the Tsukuba Primate Center, NIID.
Construction of oligo-capped cDNA libraries
Total RNA was isolated using a commercially available RNA isolation kit (Isogen, Nippon Gene;Rneasy, QIAGEN). Poly(A)+ RNA was purified using oligo-dT cellulose (Collaborative Biomedical Products; Roche). Oligo-capping was carried out as previously described . After PCR amplification, the separated products longer than 2 kb were cloned into DraIII-digested pME18S-FL3, and the plasmid vectors containing cDNA were used to transform competent cells byelectroporation.
Reverse transcription-coupled polymerase chain reaction (RT-PCR)
The templates of the human brain total mRNA were purchased from Clontech. Total RNA of cynomolgus monkey brain was isolated from the cerebrum of the cynomolgus monkey described above with Trizol (Life Technologies). A 1 μl volume of total mRNA was amplified with a One Step RNA PCR Kit (Takara). The temperature and time schedule were 40 cycles of 94°C for 30 sec, 58°C for 30 sec and 72°C for 90 sec. PCR products were separated on 1.5% agarose gel with a 100 bp ladder DNA marker (Gibco BRL).
The entire sequence of clones was determined on an ABI 3700 and 310 automated sequencers (Perkin-Elmer) by the primer walking method. Cycle sequencing was carried out using an ABI PRISM BigDye Terminator Sequencing kit (Perkin-Elmer) according to the manufacturer's instructions.
This study was supported in part by the Health Science Research Grant for the Human Genome Program from the Ministry of Health and Welfare of Japan.
- International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1086/172716.View ArticleGoogle Scholar
- Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al: The sequence of the human genome. Science. 2001, 291: 1304-1351. 10.1126/science.1058040.PubMedView ArticleGoogle Scholar
- Reese MG, Kulp D, Tammana H, Haussler D: Genie - gene finding in Drosophila melanogaster. Genome Res. 2000, 10: 529-538. 10.1101/gr.10.4.529.PubMedPubMed CentralView ArticleGoogle Scholar
- Dunham I, Shimizu N, Roe BA, Chissoe S, Hunt AR, Collins JE, Bruskiewich R, Beare DM, Clamp M, Smink LJ, et al: The DNA sequence of human chromosome 22. Nature. 1999, 402: 489-495. 10.1038/990031.PubMedView ArticleGoogle Scholar
- Guigo R, Agarwal P, Abril JF, Burset M, Fickett JW: An assessment of gene prediction accuracy in large DNA sequences. Genome Res. 2000, 10: 1631-642. 10.1101/gr.122800.PubMedPubMed CentralView ArticleGoogle Scholar
- Wiemann S, Weil B, Wellenreuther R, Gassenhuber J, Glassl S, Ansorge W, Bocher M, Blocker H, Bauersachs S, Blum H, et al: Toward a catalog of human genes and proteins: sequencing and analysis of 500 novel complete protein coding human cDNAs. Genome Res. 2001, 11: 422-435. 10.1101/gr.GR1547R.PubMedPubMed CentralView ArticleGoogle Scholar
- Hillier LD, Lennon G, Becker M, Bonaldo MF, Chiapelli B, Chissoe S, Dietrich N, DuBuque T, Favello A, Gish W, et al: Generation and analysis of 280,000 human expressed sequence tags. Genome Res. 1996, 6: 807-828.PubMedView ArticleGoogle Scholar
- Wolfsberg TG, Landsman D: A comparison of expressed sequence tags (ESTs) to human genomic sequences. Nucleic Acids Res. 1997, 25: 1626-1632. 10.1093/nar/25.8.1626.PubMedPubMed CentralView ArticleGoogle Scholar
- Osada N, Hida M, Kusuda J, Tanuma R, Iseki K, Hirata M, Suto Y, Hirai M, Terao K, Suzuki Y, et al: Assignment of 118 novel cDNAs of cynomolgus monkey brain to human chromosomes. Gene. 2001, 275: 31-37. 10.1016/S0378-1119(01)00665-5.PubMedView ArticleGoogle Scholar
- Goodman M, Porter CA, Czelusniak J, Page SL, Schneider H, Shoshani J, Gunnell G, Groves CP: Toward a phylogenetic classification of Primates based on DNA evidence complemented by fossil evidence. Mol Phylogenet Evol. 1998, 9: 585-598. 10.1006/mpev.1998.0495.PubMedView ArticleGoogle Scholar
- Goodman M, Tagle DA, Fitch DHA, Bailey W, Czelusniak J, Koop BF, Benson P, Slightom JL: Primate evolution at the DNA level and a classification of hominoids. J Mol Evol. 1990, 30: 260-266.PubMedView ArticleGoogle Scholar
- Prediction of unidentified human genes based on sequence similarity to novel cDNAs of cynomolgus monkey brain. [http://www.nih.go.jp/yoken/genebank/Supplementary_data/prediction/index.html]
- Human Genome Project Working Draft at UCSC. [http://genome.ucsc.edu/]
- Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998, 8: 967-974.PubMedPubMed CentralGoogle Scholar
- Melo JV, Yan XH, Diamond J, Lin F, Cross NCP, Goldman JM: Reverse transcription/polymerase chain reaction (RT/PCR) amplification of very small numbers of transcripts: the risk in misinterpreting negative result. Leukemia. 1996, 10: 1217-1221.PubMedGoogle Scholar
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997, 268: 78-94. 10.1006/jmbi.1997.0951.PubMedView ArticleGoogle Scholar
- GeneMark. [http://opal.biology.gatech.edu/GeneMark/]
- Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, et al: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 2001, 29: 37-40. 10.1093/nar/29.1.37.PubMedPubMed CentralView ArticleGoogle Scholar
- Suzuki Y, Ishihara D, Sasaki M, Nakagawa H, Hata H, Tsunoda T, Watanabe M, Komatsu T, Ota T, Isogai T, et al: Statistical analysis of the 5' untranslated region of human mRNA using "Oligo-Capped" cDNA libraries. Genomics. 2000, 64: 286-297. 10.1006/geno.2000.6076.PubMedView ArticleGoogle Scholar