- Web report
- Open Access
Phylogenetic classification of proteins encoded in complete genomes
- Todd Richmond
© BioMed Central Ltd 2000
- Received: 29 February 2000
- Published: 27 April 2000
For those interested in genome-wide or more restricted comparisons of proteins across species, the Clusters of Orthologous Groups (COGs) website provides the kind of information many of us want.
- Haemophilus Influenzae
- Chlamydia Trachomatis
- Orthologous Group
- Paralogous Protein
- Genome Online Database
For those interested in genome-wide or more restricted comparisons of proteins across species, the Clusters of Orthologous Groups (COGs) website provides the kind of information many of us want but for which we lack the necessary computational tools and/or power. For 21 completely sequenced organisms, including Escherichia coli, Saccharomyces cerevisiae and Haemophilus influenzae, the site gives all the clusters of paralogous proteins shared by three or more lineages. Currently, 2,112 COGs are listed. This database can be searched in many different ways - for proteins shared among particular organisms, for proteins found in some organisms but not others, for proteins found in particular metabolic pathways or by functional classification. The site provides more information than a simple BLAST search for sequence similarity can, especially for bacterial proteins. In bacteria, genes have often been sequenced many times, and it is difficult to determine from a simple search whether a particular bacterium has one gene, sequenced five times, or five genes, each sequenced once. COG removes some of that ambiguity by using only completely sequenced and annotated genomes.
The site was last updated on 24 January 2000.
There is a comprehensive help page which is essential for extracting the most information out of the site.
The information tends to be arranged in huge tables that take a long time to load. For example, to browse the list of COGs requires first loading a 431K table. When browsing through individual COGs, this table has to be reloaded each time. Ideally, the links should open in a new window so that the table never has to be reloaded. The color and abbreviations used throughout the site can be cryptic. Genomes are designated by one-letter codes, some of which are not obvious; E for E. coli and Y for S. cerevisiae (yeast) are fine, but R for Mycobacterium tuberculosis and I for Chlamydia trachomatis? Why not use a more intuitively obvious two-letter code? As for the colors, whoever designed the site must have had good color vision: but subtle shades are not the best method of conveying information for some people.
Although there are advantages in restricting COGs to completed genomes, it would be nice if a few partially sequenced higher eukaryotes were included. Sometimes one wants to ask "Is this protein found in both eukaryotes and prokaryotes, and how divergent is it?"
The COGs site is unique. Although there are others that collect information about protein domains and/or specific protein families, none of these allows browsing through the collected proteins from 21 complete genomes. It is debatable, however, how long this site will be able to maintain its present form. According to the Genomes online database, there are currently 107 prokaryotic and 31 eukaryotic genomes being sequenced, and in a few years the site will become too cumbersome in its present format.