Table 3 cDNA analysis

  DGCr1 DGCr2 Total
Clones that encode complete ORFs    
   ORFs identical to the Release 3 predicted proteins* 3,429 1,946 5,375
   ORFs with 1-2% differences to Release 3 proteins 235 306 541
Total 3,664 2,252 5,916
Clones known to be compromised    
   Nucleotide discrepancies 485 829 1314
   5' short 618 150 768
   3' truncated 57 26 83
   Co-ligated inserts 23 54 77
   ORFs with less than 50 amino acids 49 21 70
   Antisense transcripts 53 58 111
   Transposable elements 12 9 21
   Bacterial contaminants 2 4 6
Total 1,299 1,151 2,450
Clones that may represent alternative transcripts§    
   5' short with upstream in-frame stop codon 32 4 36
   3' truncated with downstream in-frame stop codon 55 17 72
   Putative missed micro-exon in Release 3 annotation 23 7 30
Total 110 28 138
Unclassified clones 257 160 417
  1. Summary of analysis of the 8,770 clones in GenBank plus 151 clones for which we do not have accession numbers yet. *The ORF predicted from the cDNA sequence is identical to the corresponding Release 3 predicted protein; 4,620 of these clones are from the LD, GH, HL, LP, RE or RH cDNA libraries, which were made from the same strain that was sequenced. Thus, we required their ORFs to be identical to those of the predicted Release 3 proteins. An additional 755 clones with ORFs identical to Release 3 proteins are from the AT, GM or SD libraries. The ORF predicted from the cDNA sequence is the same length as the Release 3 predicted protein with less than 2% amino-acid difference. These clones are derived from the AT, GM or SD cDNA libraries, which were made from strains or cell lines that are not isogenic with the strain that was sequenced. See text for explanation of the individual subclasses of compromised clones. §These clones have structures that are inconsistent with the corresponding Release 3 predicted gene. The 5'-short and 3'-truncated clones may reflect alternative splice products or promoters, or perhaps more likely, incompletely processed primary transcripts with retained introns. Additional experimental work will be required to distinguish these possibilities. Those clones referred to as putative missed micro-exons in Release 3 annotations are cases in which the cDNA clone contains additional nucleotides that are a multiple of 3, relative to the Release 3 predicted mRNA, and maintains the ORF. We expect that most of these discrepancies result from a failure of Sim4 to align micro-exons and that these cases will be resolved by modifying the Release 3 gene model; see [15] for more discussion. The predicted ORF from the cDNA clone does not match a Release 3 predicted protein, but the underlying cause could not be classified into one of the above categories. We expect that very few of these clones accurately reflect actual gene transcripts.