Skip to main content

Table 1 Identification of exons on the genome

From: A draft annotation and overview of the human genome

Category Database Total records Percent placed (%) Total unique exons Exons in complete ORFs Exons in partial ORFs Exon length (bp) ORF length (bp) Putative genes (non-splicing singletons) Protein homology (Pfam hits) CpG islands
Known UTR-DB 40,258 80 19,195 5,075 1,895 6,925,762 1,990,818 10,007 (426) 5,701 (3,813) 3,866
genes HTDB 15,305 89 48,477 12,077 7,706 11,893,081 4,043,544 4,816 (148) 2,938 (1,943) 1,960
Consensus HINT 87,125 77 103,817 47,055 15,061 23,381,024 10,144,988 20,357 (959) 9,121 (6,453) 7,557
transcripts EG 62,064 80 13,085 5,389 1,904 4,562,954 1,873,723 4,800 (154) 2,177 (1,679) 2,462
  THC 84,837 81 38,806 15,463 6,671 12,406,081 5,078,661 8,604 (322) 2,907 (2,026) 3,983
Transcripts GenBank CDS 110,222 81 41,917 31,626 1,452 5,303,064 4,299,272 2,634 (227) 1,858 (1,607) 1,178
  dbEST Human 2,154,995 73 273,881 147,819 17,694 32,288,385 14,975,758 20,073 (7,136) 5,377 (3,745) 11,807
Rodent MINT 92,531 30 8,284 5,433 120 866,046 780,566 777 123 (56) 486
transcripts RINT 37,367 46 5,600 3,588 75 592,788 546,932 458 65 (32) 255
  EMBL 43,488 28 5,819 4,108 59 724,630 655,993 202 68 (72) 135
Protein SWISS-PROT 86,593 38 27,526 12,072 1,163 9,858,797 7,784,205 1,648 1,648 (1,244) 158
homology TrEMBL 351,834 13 22,670 8,134 1,677 4,385,497 2,886,034 1,185 1,185 (654) 92
  PIR 182,106 16 4,106 1,175 383 1,355,644 764,339 321 321 (132) 20
Total     613,183 299,014 55,860 114,543,753 55,824,833 75,982 (9,372) 33,489 (23,008) 33,959
  1. Exons were identified after vector screening using transcript, rodent, and protein databases. The definition of a record varies according to the database, while 'exons' refer to high-scoring segment pairs in BlastN comparisons (E < 10-15 and sequence identity >90%) to the genome. Unique exons and all subsequent columns refer to placements that were possible after considering the preceding databases. Placement of rodent transcripts required evidence of splicing and sequence identity >80%. ORFs were identified using getorf [84] using a minimum size of 30bp to report. Protein homology required BlastX E < 10-15. Pfam hits required score >20 using hmmpfam [92]. Gene prediction programs are described in Table 2. CpG islands were identified using cpgreport [84] using standard criteria [45].