Myriads of protein families, and still counting
© BioMed Central Ltd 2003
Published: 28 January 2003
From the historical record of genome sequencing, we show that the rate of discovery of new families has remained constant over time, indicating that our knowledge of sequence space is far from complete.
With the advent of genome projects, the number of proteins has increased exponentially. We have analyzed the historical record of the discovery of 56,667 protein families encompassing 311,256 proteins from 83 complete genomes (available as of 28 May 2002). Our findings show that the rate of discovery of new families has remained constant over time, indicating that our knowledge of sequence space is far from complete.
A decade ago, it was proposed that there might be a limited number of protein families and folds . Ever since, the expectation has been that the discovery of new proteins will eventually slow down with better sampling of protein space through genome sequencing [2,3]. With a multitude of complete genomes, it is now possible to assess the extent of this notion by examining the rate of protein family discovery.
To achieve this, we have clustered all protein sequences from all 83 complete genomes, using the TRIBE-MCL algorithm . The resulting clusters represent sequence families with common functional properties and are tighter than structure-defined families or folds . For each family, we recorded the first sequenced genome in which it appears for the first time (the 'founder' genome). We then counted the number of new families each genome sequence has contributed at the moment of its release.
According to our data, the rate of protein family discovery continues to be constant over time (correlation coefficient with the genome sequencing order is R2 > 0.98). Although the major leaps have been produced by eukaryotic genomes, which contributed a third of new protein families, diversity cannot only be attributed to eukaryotes. When only the Bacteria and the Archaea are considered, the trait of a constant rate for novel families is even more pronounced (correlation coefficient R2 > 0.99), suggesting that the exploration of prokaryotic diversity using genome sequencing is far from reaching completion.
What are the major contributors of novel protein families? When normalized by the number of families per genome, the phylogenetic position of the corresponding organism is crucial. The leading contributions come from Haemophilus influenzae, Saccharomyces cerevisiae and Caenorhabditis elegans, representing the first bacterial, eukaryotic and metazoan genomes sequenced, respectively. In contrast, the smallest contributions come from multiple strains of already sequenced species.
Although there are important reasons why a genome might be sequenced other than just to cover protein-sequence space , the contribution of new protein families can also be normalized by genome size, roughly representing the corresponding sequencing effort. From this viewpoint, the human genome has added only 1.3 new families per megabase, although this may be an underestimate given the uncertainty surrounding the total number of genes . This number compares unfavorably to an average of 172 new families per megabase over all organisms, or to species such as Xylella fastidiosa and Borrelia burgdorferi with 380 new families per megabase each.
In conclusion, the constant growth rate for new protein families suggests that protein-sequence space remains largely unexplored. Sampling biological diversity through genome sequencing will continue to produce vast amounts of novel protein families with interesting biochemical properties.
- Chothia C: One thousand families for the molecular biologist. Nature. 1992, 357: 543-544. 10.1038/357543a0.PubMedView ArticleGoogle Scholar
- Vitkup D, Melamud E, Moult J, Sander C: Completeness in structural genomics. Nat Struct Biol. 2001, 8: 559-566. 10.1038/88640.PubMedView ArticleGoogle Scholar
- Fischer D, Eisenberg D: Finding families for genomic ORFans. Bioinformatics. 1999, 15: 759-762. 10.1093/bioinformatics/15.9.759.PubMedView ArticleGoogle Scholar
- Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002, 30: 1575-1584. 10.1093/nar/30.7.1575.PubMedPubMed CentralView ArticleGoogle Scholar
- Iliopoulos I, Tsoka S, Andrade MA, Janssen P, Audit B, Tramontano A, Valencia A, Leroy C, Sander C, Ouzounis CA: Genome sequences and great expectations. Genome Biol. 2001, 2: interactions0001.1-0001.3.Google Scholar
- Boucher Y, Nesbø CL, Doolittle WF: Microbial genomes: dealing with diversity. Curr Opin Microbiol. 2001, 4: 285-299. 10.1016/S1369-5274(00)00204-6.PubMedView ArticleGoogle Scholar
- Doolittle RF: Microbial genomes multiply. Nature. 2002, 416: 697-700. 10.1038/416697a.PubMedView ArticleGoogle Scholar
- Daly MJ: Estimating the human gene count. Cell. 2002, 109: 283-284.PubMedView ArticleGoogle Scholar
- The European Bioinformatics Institute Computational Genomics Group. [http://www.ebi.ac.uk/research/cgg/seqspace]