Skip to main content

Table 1 Identification of exons on the genome

From: A draft annotation and overview of the human genome

Category

Database

Total records

Percent placed (%)

Total unique exons

Exons in complete ORFs

Exons in partial ORFs

Exon length (bp)

ORF length (bp)

Putative genes (non-splicing singletons)

Protein homology (Pfam hits)

CpG islands

Known

UTR-DB

40,258

80

19,195

5,075

1,895

6,925,762

1,990,818

10,007 (426)

5,701 (3,813)

3,866

genes

HTDB

15,305

89

48,477

12,077

7,706

11,893,081

4,043,544

4,816 (148)

2,938 (1,943)

1,960

Consensus

HINT

87,125

77

103,817

47,055

15,061

23,381,024

10,144,988

20,357 (959)

9,121 (6,453)

7,557

transcripts

EG

62,064

80

13,085

5,389

1,904

4,562,954

1,873,723

4,800 (154)

2,177 (1,679)

2,462

 

THC

84,837

81

38,806

15,463

6,671

12,406,081

5,078,661

8,604 (322)

2,907 (2,026)

3,983

Transcripts

GenBank CDS

110,222

81

41,917

31,626

1,452

5,303,064

4,299,272

2,634 (227)

1,858 (1,607)

1,178

 

dbEST Human

2,154,995

73

273,881

147,819

17,694

32,288,385

14,975,758

20,073 (7,136)

5,377 (3,745)

11,807

Rodent

MINT

92,531

30

8,284

5,433

120

866,046

780,566

777

123 (56)

486

transcripts

RINT

37,367

46

5,600

3,588

75

592,788

546,932

458

65 (32)

255

 

EMBL

43,488

28

5,819

4,108

59

724,630

655,993

202

68 (72)

135

Protein

SWISS-PROT

86,593

38

27,526

12,072

1,163

9,858,797

7,784,205

1,648

1,648 (1,244)

158

homology

TrEMBL

351,834

13

22,670

8,134

1,677

4,385,497

2,886,034

1,185

1,185 (654)

92

 

PIR

182,106

16

4,106

1,175

383

1,355,644

764,339

321

321 (132)

20

Total

   

613,183

299,014

55,860

114,543,753

55,824,833

75,982 (9,372)

33,489 (23,008)

33,959

  1. Exons were identified after vector screening using transcript, rodent, and protein databases. The definition of a record varies according to the database, while 'exons' refer to high-scoring segment pairs in BlastN comparisons (E < 10-15 and sequence identity >90%) to the genome. Unique exons and all subsequent columns refer to placements that were possible after considering the preceding databases. Placement of rodent transcripts required evidence of splicing and sequence identity >80%. ORFs were identified using getorf [84] using a minimum size of 30bp to report. Protein homology required BlastX E < 10-15. Pfam hits required score >20 using hmmpfam [92]. Gene prediction programs are described in Table 2. CpG islands were identified using cpgreport [84] using standard criteria [45].