A functional update of the Escherichia coliK-12 genome
© Serres et al., licensee BioMed Central Ltd 2001
Received: 2 April 2001
Accepted: 10 July 2001
Published: 20 August 2001
Since the genome of Escherichia coli K-12 was initially annotated in 1997, additional functional information based on biological characterization and functions of sequence-similar proteins has become available. On the basis of this new information, an updated version of the annotated chromosome has been generated.
The E. coli K-12 chromosome is currently represented by 4,401 genes encoding 116 RNAs and 4,285 proteins. The boundaries of the genes identified in the GenBank Accession U00096 were used. Some protein-coding sequences are compound and encode multimodular proteins. The coding sequences (CDSs) are represented by modules (protein elements of at least 100 amino acids with biological activity and independent evolutionary history). There are 4,616 identified modules in the 4,285 proteins. Of these, 48.9% have been characterized, 29.5% have an imputed function, 2.1% have a phenotype and 19.5% have no function assignment. Only 7% of the modules appear unique to E. coli, and this number is expected to be reduced as more genome data becomes available. The imputed functions were assigned on the basis of manual evaluation of functions predicted by BLAST and DARWIN analyses and by the MAGPIE genome annotation system.
Much knowledge has been gained about functions encoded by the E. coli K-12 genome since the 1997 annotation was published. The data presented here should be useful for analysis of E. coli gene products as well as gene products encoded by other genomes.
The field of genomics has been expanding at a rapid pace since the annotated Escherichia coli K-12 genome was published in 1997 , with the current number of published genomes exceeding 66 and with another 364 on their way according to the Genomes OnLine Database (GOLD) . Deciphering the functions encoded by all gene products of the genomes is the next big challenge in the field. Function attributions through experimental, biochemical and genetic analyses and through bioinformatic studies are continuing, and microarray technology is shedding additional light on the functions associated with the gene products of the organism in question. The wealth of biological information on E. coli is still increasing  and is contributing to a better understanding of this organism as well as of functions encoded in other organisms. It is therefore important that the most up-to-date information on E. coli gene products is available and used by researchers.
Several databases have been assembled for various areas of knowledge about the E. coli genome [4,5,6,7,8,9]. Each compilation has a different emphasis and collects different sets of information related to the function of the gene products. In the GenProtEC database, we have been curating information on physiological function and modular construction of gene products. Other databases most closely related to ours include EcoCyc, with emphasis on metabolic pathways , the CGSC database, with information on the genotypes and phenotypes of mutant strains , and EcoGene, which includes information on gene reconstructions, alternative gene boundaries and verified amino-terminal amino-acid sequences of the mature proteins . The E. coli genome project at the University of Wisconsin-Madison presents genome data on E. coli K-12 and pathogenic enterobacteria .
We present a functional update for E. coli K-12 gene products that incorporates information from the literature and referenced databases obtained since the 1997 GenBank deposit. Our focus has been the biological function of the gene products. Coding sequences (CDSs) encoding proteins whose function previously was imputed or not known were re-evaluated, and putative functions were assigned by manually evaluating the results from BLAST and DARWIN (data analysis and retrieval with indexed nucleotide/peptide sequences) analyses. The MAGPIE (multipurpose automated genome project investigation environment) genome annotation system  was also applied. MAGPIE detected alternative boundaries for some of the open reading frames (ORFs).
Number of genes in the E. coliK-12 genome
For the initial annotation of the E. coli K-12 genome , 4,404 genes were identified with Blattner numbers (Bnums). Among the genes, 4,288 were believed to encode proteins and 116 to encode RNAs. Since then six Bnums have been retired: bo322, bo395, bo663, bo667, bo669 and bo671 (G. Plunkett, personal communication). In addition, three new genes have been identified and assigned to Bnums. These include the protein-coding b4406 (yaeP, SWISS-PROT P52099) and b4407 (thiS, SWISS-PROT 032583) and the RNA encoding b4408. The current number of E. coli genes is 4,401, with 4,285 encoding proteins and 116 encoding RNAs.
MAGPIE identified 5,527 candidate CDSs that were assigned to MAGPIE identifiers (Magnums) (see MAGPIE  for details). The 4,285 CDSs identified by Bnums were also identified with Magnums. Variations were detected for either the start or stop positions for 1,077 of these CDSs resulting in differences in the encoded proteins ranging from 1 to 147 amino acids, the latter in PtsA (Bnum b3947, Magnum ec_6103). The other Magnum-identified candidate CDSs include retired Bnums (six Magnums), CDSs located between the boundaries of Bnums (506 Magnums), and CDSs overlapping existing Bnums (730 Magnums). Among the Magnums located between the boundaries of Bnums are 21 CDSs that encode proteins of 80 or more amino acids. One such CDS identified by MAGPIE (Magnum ec_2510) is located between b1624 and b1625 and encodes a protein of 66 amino acids. The carboxy-terminal 41 amino acids of this CDS are identical to the amino-acid sequence of the recently characterized beta-lactam resistance protein Blr (SWISS-PROT P56976) located at the same position . Other Magnums located between Bnum boundaries may correspond to short E. coli proteins.
Functional annotation of E. coliK-12 gene products
The functional assignments of the E. coli gene products in the November 97 GenBank U00096 deposit represented an accumulation of information retrieved from the literature (collected in the GenProtEC and EcoCyc databases) as well as imputed functions based on similarity of a known protein to the translated sequences . Since the deposit to GenBank was made, our database GenProtEC has continually been updated with knowledge on E. coli gene products appearing in the literature [3,13]. Information on transcriptional regulators has been incorporated from the work of J. Collado-Vides [14,15], and transport protein information has been adapted from the work of M.H. Saier and I.T. Paulsen [16,17]. GenProtEC also contains imputed function assignments based on sequence similarity to orthologous or paralogous proteins, on gene (operon) location and on phenotypes of mutants .
Gene products whose functions were known were not considered further for the functional update. The remaining 2,294 CDSs whose gene products had a putative or unknown function assignment were analyzed using BLAST and DARWIN. BLAST analyses were carried out for both the Bnum- and the Magnum-derived protein sequences. The results for the Bnum-derived protein sequences and the automatic functions predicted by MAGPIE or HERON (human-emulated reasoning for objective notations) were manually evaluated and imputed functions were assigned. Although the manual annotation step could not compete with the speed of the automatic annotation process of HERON, it provided us with more useful function descriptions. A comparison of the manually assigned putative functions with the HERON predicted functions showed that when leaving aside issues of specificity, a nearly equivalent function was predicted in 46% of the cases, whereas in 52% of the cases less information was obtained with HERON.
After the function update of the 2,294 CDSs, 1,306 gene products were assigned a putative function and 126 gene products were described by a phenotype. The remaining gene products were given one of the following three assignments: 'conserved protein', where sequence-similar matches were found but the function could not be determined in the absence of consistent functions reported for the matching sequences; 'conserved hypothetical protein', where sequence-similar matches existed but these had no associated function; 'unknown CDS', where the translated sequence had no known sequence match outside E. coli. The current function description includes 256 conserved proteins, 282 conserved hypothetical proteins and 324 unknown CDSs. The 862 gene products with no function assignment represent 19.6% of the E. coli chromosomal genes, and the unknown CDSs at this time represent 7.4% of E. coli genes.
A sample of annotated E. coli K-12 genes
Gene product type*
l-carnitine dehydratase, NAD(P)-binding
Putative acyl-CoA dehydrogenase
Na+/H+ antiporter, NhaA family
Putative betaine/carnitine/choline transport protein, BCCT family
Transcriptional regulator of arabinose catabolism, AraC/XylS family
Putative transcriptional regulator of leucine biosynthesis, LysR family
Outer membrane protease, receptor for phage OX2
Putative membrane protein
Protein chain elongation factor EF-Ts
Putative peptide chain release factor
30S ribosomal subunit protein S20
Putative fimbrial-like protein
Putative electron transfer flavoprotein
Organic solvent tolerance
thr operon leader peptide
Conserved hypothetical protein
Many changes are evident when comparing the updated annotation to that of 1997. The number of CDSs without function assignment has been reduced from 1,354 to 862. This reduction is due to functions being experimentally determined (77 CDSs), assignment of putative functions (367 CDSs), phenotype-associated functions (14 CDSs), and genes identified as belonging to phages (138 CDSs). In addition, inferred function assignments were withdrawn for 104 CDS-coded proteins whose functions remain unknown.
The number of gene products with putative function assignments has changed from 1,120 to 1,306. New functions were inferred for 473 CDSs. Putative function assignments were also removed as a result of new experimental data (175 CDSs), assignment of phenotype (8 CDSs) or reassessment of putative function assignments (104 CDSs).
Proteins as modular entities
Some of the proteins encoded in the E. coli genome have arisen through fusion of two or more genes. Examples of such gene fusions are the multifunctional enzymes Aas (2-acylglycerophospho-ethanolamine acyl transferase and acyl-acyl carrier protein synthetase) and G1mU (N-acetyl glucosamine-1-phosphate uridyltransferase and glucosamine-1-phosphate acetyl transferase) [19,20]. We have chosen to deal with proteins as modular entities where a module is defined as a protein element that has at least 100 amino-acid residues, carries a biological function and is presumed to have an independent evolutionary history . Most modules in E. coli are individual proteins. They can, however, also be part of a protein where multiple modules have been joined by gene fusion, as is the case for Aas and G1mU. Other protein types in E. coli such as transporters and regulators also involve gene fusion events. The current modular assignments are based on analysis of protein sequences within E. coli K-12 (P. Liang and M. Riley, unpublished data).
A sample of multimodular gene products of E. coli K-12
Gene product type*
Glycosyl transferase of penicillin-binding protein 1b (2nd module)
Transpeptidase of penicillin-binding protein 1b (3rd module)
PTS family enzyme IIC, n-acetylglucosamine-specific (1st module)
PTS family enzyme IIB, n-acetylglucosamine-specific (2nd module)
PTS family, enzyme IIA, n-acetylglucosamine-specific (3rd module)
ABC superfamily (membrane) cytochrome-related transporter (1st module)
ABC superfamily (atp_bind) cytochrome-related transporter (2nd module)
Acetaldehyde-CoA dehydrogenase (1st module)
Iron-dependent alcohol dehydrogenase (2nd module)
Putative transcriptional regulator, GntR family (1st module)
Putative ATP-binding component of a transport system (2nd module)
PTS family enzyme IIC, maltose and glucose-specific (1st module)
PTS family enzyme IIB, maltose and glucose-specific (2nd module)
Putative malic oxidoreductase (1st module)
Putative phosphate acetyl transferase (3rd module)
Transcriptional activator of hca cluster, LysR family (1st module)
Putative oxidoreductase (2nd module)
2-acylglycerophospho-ethanolamine acyl transferase (1st module)
Acyl-acyl carrier protein synthetase (2nd module)
Membrane-binding component of cell division protein (1st module)
GTPase component of cell division membrane protein (2nd module)
2-dehydro-3-deoxygalactonate 6-phosphate aldolase (1st module)
Galactonate dehydratase (2nd module)
N-acetyl glucosamine-1-phosphate uridyltransferase (1st module)
Glucosamine-1-phosphate acetyl transferase (2nd module)
3-hydroxybutyryl-coa epimerase; delta(3)-cis-delta(2)-trans-enoyl-coa-
isomerase;enoyl-coa-hydratase (1st module)
3-hydroxyacyl-coa dehydrogenase (2nd module)
Phenotypic repressor of mal operon (2nd module)
ABC superfamily (atp_bind) maltose transport protein (1st module)
Gene products encoded by the E. coli K-12 chromosome
Gene product type
History of distribution of gene product types for E. coli K-12
Gene product type
An updated version of the function assignments for E. coli K-12 gene products has been presented using the genes identified in the GenBank U00096 deposit. Alternative gene boundaries were produced by MAGPIE. The MAGPIE genome annotation system also identified candidate CDSs that may represent gene products not identified in the GenBank U00096 deposit. Small ORFs with biological activity are likely to be abundant in the organism but await verification by biological data. Undoubtedly, the intergenic regions of E. coli K-12, as studied by Rudd  and Bachellier et al. , are also important for the function and regulation of gene products.
The percentage of identified chromosomal gene products without a function assignment is decreasing and is currently 19.6%. Only 7.4% of E. coli genes have no match in current sequence databases. This number will be further reduced with the release of the annotated genomes of Salmonella, Shigella and other closely related organisms. Preliminary data show that the number of unknown CDSs (ORFs encoding proteins without sequence-similar matches) will be less than 170 after data on the Salmonella typhimurium genome is included (M.H.S., unpublished data).
The function assignments presented here mainly represent the molecular functions of the gene products. With the generation of microarray data, gene products will also be characterized to a greater degree by the role they play in the cell under specific conditions. We have recently developed a classification system for cellular functions of E. coli K-12 gene products and have assigned more than one cellular role to some gene products where this is appropriate . There is also a need for a more uniform way of describing both the molecular and cellular roles of gene products among diverse organisms, and this issue is currently being addressed by the Gene Ontology Consortium .
We have presented a functional update of the gene products encoded by the genes of E. coli K-12 identified in the GenBank Accession U00096 deposit. The E. coli proteins were treated as modular entities where a module is at least 100 amino acids, carries a biological function, and has an independent evolutionary history. The functional update was performed by manual evaluation of the data obtained from GenProtEC, BLAST and DARWIN analyses, and MAGPIE annotation. A table containing the updated function assignments of E. coli K-12 gene products is available as an additional data file online, and at GenProtEC  and MAGPIE . We believe these data will be valuable for analysis of E. coli K-12 itself as well as for the analysis of gene products encoded by other genomes.
Materials and methods
MAGPIE ORF prediction
A three-step approach to ORF prediction was taken to prepare the MAGPIE project for E. coli. GLIMMER 2.0 with a minimum ORF length of 80 nucleotides was initially used to create the base set of predictions . Glimmer 2.0 was run with all default parameters, as recommended in the documentation  and trained on the annotated set of ORFs from the Blattner et al. release of 1997 . Because GLIMMER selectively identifies ORFs that match a statistical model of a gene for the organism , GLIMMER may miss genes that were laterally transferred or acquired more recently from other genomes. We therefore chose to combine the GLIMMER predictions with those of a syntactic tool encoded within MAGPIE. This tool identifies stop codons and then 'backtracks' to the farthest upstream acceptable in-frame start codon and defines this as the ORF . A non-redundant set of all GLIMMER ORFs plus syntactic ORFs between GLIMMER ORFs was generated. Finally, ORFs annotated by Blattner et al. that were not present in the non-redundant set were added to the MAGPIE project.
The CDSs were compared to the NCBI nucleotide (nt) and non-redundant protein (nr) databases using gapped BLAST . Protein-sequence motifs were identified by PROSITE . A search against the MAGPIE-predicted proteins of over 40 completed genomes, including the previously annotated E. coli set, was also performed.
Automated function annotation was provided using HERON. Description lines with low information content (for example, descriptions containing words such as "hypothetical" or "putative") were filtered out. HERON then calculated word frequencies in the remaining descriptions, identified the top three most common words, and selected the description of the highest-scoring sequence match (for homology comparisons) with one or more high-frequency words. The selected description became the automated annotation for the coding region.
The protein sequences collected from GenBank Accession U00096 were compared to the nr database using gapped BLAST .
DARWIN (version 2.0) was used to detect sequence-similar proteins within E. coli K-12 and in 20 additional microbial genomes  (P. Liang and M. Riley, unpublished data). In addition to orthologous matches, groups of paralogous proteins of E. coli K-12 were generated on the basis of the DARWIN results. In our hands, DARWIN is particularly successful in identifying distant sequence similarities, a consequence no doubt of the application of multiple substitution matrices optimized for the organism and to each sequence pair.
Functions were assigned to gene products on the basis of a manual evaluation of the results from the BLAST and DARWIN analyses. The automatic function prediction was also taken into account. In addition to incorporating recent experimental information, a substantial amount of human judgment was brought to bear.
Additional data files
A complete table of the current 4,401 Bnums is provided as an Excel file.
This work was supported by NIH grant ROI RR07861, the NASA Astrobiology Institute grant NCC2-1054, grants from the Edward Mallinckrodt, Jr Foundation and the Sinsheimer Foundation, and NSF grants NSF DBI-9984882 and NSF IIS – 9996304. We thank Alastair Kerr for help on data retrieval and Edward A. Adelberg for help on monitoring the E. coli literature. We thank Guy Plunkett 3rd for information on Blattner number status. Mark Schroeder for assistance with the MAGPIE analysis, and Peter Karp and Stefan Bekiranov for suggestions regarding the design and implementation of HERON.
- Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, et al: The complete genome sequence of Escherichia coli K-12. Science. 1997, 277: 1453-1474. 10.1126/science.277.5331.1453.PubMedView ArticleGoogle Scholar
- GOLD: Genomes OnLine Database homepage. [http://igweb.integratedgenomics.com/GOLD/]
- Riley M, Serres MH: Interim report on genomics of Escherichia coli. Annu Rev Microbiol. 2000, 54: 341-411. 10.1146/annurev.micro.54.1.341.PubMedView ArticleGoogle Scholar
- GenProtEC database. [http://genprotec.mbl.edu/]
- Rudd KE: EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res. 2000, 28: 60-64. 10.1093/nar/28.1.60.PubMedPubMed CentralView ArticleGoogle Scholar
- Karp PD, Riley M, Saier M, Paulsen IT, Paley SM, Pellegrini-Toole A: The EcoCyc and MetaCyc databases. Nucleic Acids Res. 2000, 28: 56-59. 10.1093/nar/28.1.56.PubMedPubMed CentralView ArticleGoogle Scholar
- Thomas GH: Completing the E. coli proteome: a database of gene products characterised since the completion of the genome sequence. Bioinformatics. 1999, 15: 860-861. 10.1093/bioinformatics/15.10.860.PubMedView ArticleGoogle Scholar
- CGSC: E.coli Genetic Stock Center. [http://cgsc.biology.yale.edu/]
- E. coli genome project University of Wisconsin-Madison. [http://www.genome.wisc.edu/]
- Gaasterland T, Sensen CW: Fully automated genome analysis that reflects user needs and preferences. A detailed introduction to the MAGPIE system architecture. Biochimie. 1996, 78: 302-310. 10.1016/0300-9084(96)84761-4.PubMedView ArticleGoogle Scholar
- MAGPIE automated genome project investigation environment. [http://genomes.rockefeller.edu/magpie/ecoli/]
- Wong RS, McMurry LM, Levy SB: 'Intergenic' blr gene in Escherichia coli encodes a 41-residue membrane protein affecting intrinsic susceptibility to certain inhibitors of peptidoglycan synthesis. Mol Microbiol. 2000, 37: 364-370. 10.1046/j.1365-2958.2000.01998.x.PubMedView ArticleGoogle Scholar
- Serres MH, Riley M: Genomics and metabolism in Escherichia coli. In The Prokaryotes: An Evolving Electronic Database for the Microbiological Community. Edited by Dworkin M, et al. New York: Springer-Verlag,. 2000, [http://www.prokaryotes.com]Google Scholar
- Perez-Rueda E, Collado-Vides J: The repertoire of DNA-binding transcriptional regulators in Escherichia coli K-12. Nucleic Acids Res. 2000, 28: 1838-1847. 10.1093/nar/28.8.1838.PubMedPubMed CentralView ArticleGoogle Scholar
- RegulonDB. [http://www.cifn.unam.mx/regulondb/]
- Saier MH: A functional-phylogenetic classification system for transmembrane solute transporters. Microbiol Mol Biol Rev. 2000, 64: 354-411. 10.1128/MMBR.64.2.354-411.2000.PubMedPubMed CentralView ArticleGoogle Scholar
- Genomic Comparisons of Membrane Transport Systems. [http://www.biology.ucsd.edu/~ipaulsen/transport/]
- Riley M: Genes and proteins of Escherichia coli K-12. Nucleic Acids Res. 1998, 26: 54-10.1093/nar/26.1.54.PubMedPubMed CentralView ArticleGoogle Scholar
- Jackowski S, Jackson PD, Rock CO: Sequence and function of the aas gene in Escherichia coli . J Biol Chem. 1994, 269: 2921-2928.PubMedGoogle Scholar
- Mengin-Lecreulx D, van Heijenoort J: Copurification of glucosamine-1-phosphate acetyltransferase and N-acetylglucosamine-1-phosphate uridyltransferase activities of Escherichia coli : characterization of the glmU gene product as a bifunctional enzyme catalyzing two subsequent steps in the pathway for UDP-N-acetylglucosamine synthesis. J Bacteriol. 1994, 176: 5788-5795.PubMedPubMed CentralGoogle Scholar
- Riley M, Labedan B: Protein evolution viewed through Escherichia coli protein sequences: introducing the notion of a structural segment of homology, the module. J Mol Biol. 1997, 268: 857-868. 10.1006/jmbi.1997.1003.PubMedView ArticleGoogle Scholar
- Vickers LP, Ackers GK, Ogilvie JW: Aspartokinase I-homoserine dehydrogenase I of Escherichia coli K12. Concentration-dependent dissociation to dimers in the presence of L-threonine. J Biol Chem. 1978, 253: 2155-2160.PubMedGoogle Scholar
- Truffa-Bachi P, Van Rapenbusch R, Gros C, Cohen GN, Janin J: The threonine-sensitive homoserine dehydrogenase and aspartokinase activities of Escherichia coli K-12. Subunit structure of the protein catalyzing the two activities. Eur J Biochem. 1969, 7: 401-407.PubMedView ArticleGoogle Scholar
- Riley M: Functions of the gene products of Escherichia coli. Microbiol Rev. 1993, 57: 862-952.PubMedPubMed CentralGoogle Scholar
- Rudd KE: Novel intergenic repeats of Escherichia coli K-12. Res Microbiol. 1999, 150: 653-664. 10.1016/S0923-2508(99)00126-6.PubMedView ArticleGoogle Scholar
- Bachellier S, Clement JM, Hofnung M: Short palindromic repetitive DNA elements in enterobacteria: a survey. Res Microbiol. 1999, 150: 627-639. 10.1016/S0923-2508(99)00128-X.PubMedView ArticleGoogle Scholar
- Serres MH, Riley M: MultiFun, a multifunctional classification scheme for Escherichia coli K-12 gene products. Microb Comp Genomics. 2000, 5: 205-222.PubMedView ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- Delcher AL, Harmon D, Kasif S, White O, Salzberg SL: Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 1999, 27: 4636-4641. 10.1093/nar/27.23.4636.PubMedPubMed CentralView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Bairoch A: PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res. 1992, Suppl 20: 2013-2018.View ArticleGoogle Scholar
- Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science. 1992, 256: 1443-1445.PubMedView ArticleGoogle Scholar