Comparative genomics of gene-family size in closely related bacteria
© Pushker et al. 2004
Received: 12 December 2003
Accepted: 6 February 2004
Published: 18 March 2004
Skip to main content
We’re sorry, something doesn't seem to be working properly. Please try refreshing the page. If that doesn't work, please contact us so we can address the problem.
© Pushker et al. 2004
Received: 12 December 2003
Accepted: 6 February 2004
Published: 18 March 2004
The wealth of genomic data in bacteria is helping microbiologists understand the factors involved in gene innovation. Among these, the expansion and reduction of gene families appears to have a fundamental role in this, but the factors influencing gene family size are unclear.
The relative content of paralogous genes in bacterial genomes increases with genome size, largely due to the expansion of gene family size in large genomes. Bacteria undergoing genome reduction display a parallel process of redundancy elimination, by which gene families are reduced to one or a few members. Gene family size is also influenced by sequence divergence and physiological function. Large gene families show wider sequence divergence, suggesting they are probably older, and certain functions (such as metabolite transport mechanisms) are overrepresented in large families. The size of a given gene family is remarkably similar in strains of the same species and in closely related species, suggesting that homologous gene families are vertically transmitted and depend little on horizontal gene transfer (HGT).
The remarkable preservation of copy numbers in widely different ecotypes indicates a functional role for the different copies rather than simply a back-up role. When different genera are compared, the increase in phylogenetic distance and/or ecological specialization disrupts this preservation, albeit in a gradual manner and maintaining an overall similarity, which also supports this view. HGT can have an important role, however, in nonhomologous gene families, as exemplified by a comparison between saprophytic and enterohemorrhagic strains of Escherichia coli.
One of the unexpected revelations of prokaryotic genomes has been the existence of significant gene redundancy. The existence of multiple gene copies in eukaryotes has been known for a long time and is considered an important element in their molecular evolution [1, 2]. In pre-genomic times, however, bacteria were considered to be streamlined cells that carried very little, if any, redundant information in their genomes. It therefore came as a surprise when the genome of Escherichia coli K12 showed that nearly 30% of the coding sequences could be grouped into gene families that were similar enough to be assigned similar functions [3, 4]. They were described as 'paralog' gene families, with the implicit assumption that their similarity reflected similar evolutionary descent, but actual or potential functional divergence. Since then, the presence of gene families typically containing between two and 30 copies has been described for nearly every prokaryotic genome sequenced. The number of paralogous genes and families appears to correlate well with an increase in genome size [5, 6]. The relative contribution of these genes in each genome seems to be independent of phylogenetic affiliation and, for a limited dataset, appears to depend on genome size .
These gene families of diverse size and degree of similarity remain an important and little explored feature of prokaryotes. In eukaryotic genomes they are generally taken as the result of gene duplication. This would either supply the required gene dosage or the raw material for adaptation by mutation and selection acting on one of the copies that diverges in properties or function [1, 8]. In E. coli, a model organism in which traditional genetics and physiology have already allowed the unequivocal identification of more than half of the coding genes, the role of paralog families (whatever their origin) seems much more operational than in eukaryotes . For example, the different members of a gene family contribute the proper gene dosage or, most often, provide different specificities for similar chemical reactions or for other processes such as transport of different molecules. Regarding origin, duplication is not necessarily the only source for new members of a gene family in prokaryotes. The gene pools are known to vary enormously from one strain to another [9, 10], and horizontal gene transfer (HGT) acts as a powerful source of innovation . Therefore, HGT could provide gene families with members already divergent in sequence and function . In prokaryotes, gene families could be the result of incomplete xenologous gene replacement by which a gene from another genome gets incorporated into a gene family with which it shares some sequence similarity. This process would provide additional physiological plasticity, and studies on the DNA composition of paralogous genes suggest that its contribution might be substantial . The divergence of some of the members of the gene families or their DNA composition could be taken as evidence for a HGT origin . It is unclear at the moment the extent to which each of these genomic forces (gene duplication and HGT) contributes to genome expansion and variability [5, 14–16].
To address these issues we have compared the size of gene families across bacterial taxa. To try to shed light on the evolutionary origin of these initially redundant genes we have studied the distribution of gene family size among completed genomes of strains within the same bacterial species and over larger taxonomic distances. If the different family members were acquired by HGT their numbers will vary widely among different strains, as already detected for single genes in adaptive islands  or for whole families predicted to have been transferred as a whole . On the other hand, if the family numbers are similar in different strains, vertical descent or a very old HGT will be a more likely origin. We have also determined the contribution of paralogous families to genome size for all 127 available eubacterial genomes, updating earlier work on a more limited dataset . We have also tried to identify other factors affecting the number of members in a family, besides genome size, particularly sequence divergence, gene function and species lifestyle.
Exceptions to the linear correlation in this graph are interesting to consider. On one hand, Pirellula (marked as Pir in Figure 1) has an enormous genome with a surprisingly low relative number of paralogs. This is due to an overrepresentation of small gene families and the absence of large ones (the largest gene family contains 57 members; see Additional data file 1). Pirellula is a marine bacterium and the reason for the reduced gene family size might be the homogeneity of the marine environment, in contrast to other large-genomed bacteria included in the graph which have the ability to survive in many different niches or in much more heterogeneous habitats, such as soil. In agreement with this, Pirellula has a greatly reduced number of transcriptional regulators, which again might reflect a relatively constant environment . At the other end of the distribution, exceptions occur for three species that have small genomes with a larger-than-expected percentage of paralogs. All these species are mycoplasmas, and the high percentage of paralogs is due to a few gene families that are greatly expanded, including more than 25 members. In Mycoplasma penetrans, for example, these families include surface-exposed lipoproteins involved in antigenic variation , which are critical to the success of microbes exposed to the immune system of their hosts. On the other hand, the small genomes of other pathogenic bacteria correspond to intracellular parasites that do not need to evade the immune system , and these species show the smallest portion of paralogs. Finally, the largest gene families that we detected were those involving mobile genetic elements such as the IS elements of Shigella flexneri, where families surpassed 100 members (not included in the inset of Figure 1).
Another feature we could detect in the evolution of gene families was that large families were more divergent (Figure 3). This could partly be due to a side-effect of the higher variability of a larger sample size or to misidentification of family members at low sequence identity levels. However, given the observed similarity of functions in these large families ([4, 28] and R.P., A.M. and F.R-V., unpublished results), a substantial proportion must be true paralogous genes. Thus, this relationship can be interpreted as older (more divergent) families containing more members. Smaller families range from those with very similar members to those in which the members are very different. The latter probably represent either old families in which new members have not evolved because new duplications do not confer a selective advantage, or more recent incomplete xenologous replacements.
The sequencing of several strains of a single species is now common in bacterial genomics. One of the most remarkable findings has been the different gene pools carried by strains that are highly similar if their housekeeping genes only are compared. For example, different virotypes of E. coli were shown to contain very different gene complements, with large pools of genes characteristic of each virotype . Obvious candidates to vary would be multigene families. Thus, the comparison of the numbers of members within a single species might shed light in their origin. If the members of a gene family are frequently acquired by HGT from outside, the numbers should be expected to vary broadly in different lineages of the species (as a result of different acquisitions). On the other hand, if the numbers are similar, that would indicate that the families were already present in the common ancestor and represent a relatively stable feature of the genome.
We have obviously not excluded the possibility that nonhomologous gene families add to the differences among the compared genomes. For example, in a pairwise comparison between E. coli K12 and E. coli O157:H7, 186 genes belonging to paralog families were unique to K12 and 788 to O157:H7, versus 403 singletons (single-copy genes not belonging to families) unique to K12 and 883 to O157:H7. Thus, K12 keeps the same standard proportion of 30% paralogs for the differential gene pool. In O157:H7, on the other hand, paralogs account for 47% of the set of unique genes. The interpretation might be that the large islands that characterize the genome of the enterohemorrhagic virotype tend to carry a bigger proportion of families than the rest of the genome. Thus, it is possible that in some strains, HGT may contribute to expand and generate gene families that do not appear as homologs in closely related genomes. For example, 146 genes belonging to families of 10 or more members were detected in the O157:H7 differential pool, including three whole families of 14, 17 and 20 members with a G+C content of 57, 54 and 53%, respectively (the average G+C content in E. coli O157:H7 is 50.6%). The largest differential family in K12 had 11 members, which were not present in the enterohemorrhagic strain, and had a G+C content of 54.1% (the average G+C content of E. coli K12 is 50.5%).
Figure 5c shows the difference in gene family size in the interspecific comparison of E. coli K12 and S. typhimurium LT2. Both strains have similarly sized genomes (S. typhimurium is 218 kb larger) and a relatively high level of homology (3,026 orthologous genes). Of these, there are 572 homologs belonging to families that differ in size between the two genomes, and 435 belonging to families having the same number of members in both species. The rest are single-copy genes in both genomes. Forty-eight families were significantly larger (two or more extra copies) in E. coli, while 53 were larger in Salmonella. These differences can be taken as an example of the evolution of gene families in two diverging groups. Although the natural history of these model bacteria is not as well known as might be expected, it is generally believed that both Salmonella and Escherichia are mostly saprophytic facultative anaerobes that inhabit the intestine of vertebrates. The divergence between these two microbes arose after the origin of mammals around 120 million years ago. E. coli specialized as a commensal and an opportunistic pathogen of mammals, as witnessed, for example, by its ability to degrade lactose. On the other hand, Salmonella remains as a commensal in reptiles, with some serotypes colonizing mammals, but as a pathogen rather than a commensal and after developing strategies for intracellular invasion of the host [30, 31]. Accepting this scenario, the fact that many gene families (and the number of members of each family) are preserved reflects a significant involvement in the saprophytic intestinal lifestyle, preserved over many millions of years. On the other hand, significant differences are starting to arise between the two species, perhaps reflecting their specialization in different hosts and lifestyles . A dramatic example is the potG gene family, which has 13 more members in S. typhimurium than in E. coli (Figure 5). This is an ATP-binding component of spermidine/putrescine transport and for some reason its amplification has been selected in this species. Proteins involved in the transport of spermidine and putrescine have been shown to be involved in attachment to host cells and virulence . Therefore, the size of this gene family might reflect the more pathogenic lifestyle of Salmonella.
In eukaryotic genomes, a cornerstone of gene creation is extension of paralogous families by gene duplication . This is reflected in the slow increase of new gene families with genome size, which does correlate with an increase in the size of the families . The importance of DNA duplication in eukaryotes is probably also favored by the limitations of HGT in this group . Despite the pervasiveness of HGT in prokaryotes, the increase in gene families with genome size is also robust (Figure 1). One obvious fact contributing to this situation might be that the pool of essential genes that have to be present for basic cell biology represents a larger percentage of a smaller genome, restricting the contribution of redundant genes with related functions and thus more expendable. However, this does not explain the high level of correlation maintained at the larger end of the range.
Of course, with the number of genomes available presently there is a certain representation bias, with a large input from human pathogens. Among these, small genomes often correspond to intracellular forms that are protected from the immune system of the host. Variability of antigen specificity is one paradigmatic case that justifies gene familes in extracellular pathogens of vertebrates, for example the PPE genes of Mycobacterium  and the Pap adhesins in E. coli . The exceptional case of the mycoplasmas points in this direction as they possess small genomes but are extracellular mucosa-associated pathogens, and hence subjected to the host immune system . At the other end of the genome size range there are many more free-living, saprophytic or opportunistic pathogens, a lifestyle that requires a highly versatile gene complement in order to survive, for example, both inside and outside a host. Again, the one exception is a single large-genome species from a relatively stable environment (Pirellula, which lives in the open ocean). Here, the possibility to carry out many different physiological activities is probably more advantageous than the ability to adapt the same activity to a wider range of conditions. Thus, as with other aspects of biology, the genomic properties of bacteria appear to be greatly conditioned by their specialist or generalist lifestyle.
The comparison of gene family size among strains from a single species shows a remarkable level of conservation, even when genome sizes are very different. This conservation indicates that gene family size is probably an ancestral feature rather than reflecting the acquisition of paralogs by HGT. This is consistent with evolutionary models based on bacterial gene content, which concluded that most protein gene families are transmitted by vertical inheritance . The conservation that is detected even among more distantly related taxa strengthens this view, as in mostly free-living and very niche-diversified species such as Pseudomonas, there is a remarkable degree of conservation. This might reflect involvement of the gene families in more fundamental (less environment-dependent) processes of cell biology.
Genomic evolution simulations concluded that the amount of gene duplication is independent of HGT levels . On the basis of these simulations, an upper limit of 20% was estimated for paralogs of xenologous origin. Assuming that the extra members of a gene family from our paralog plots represent an upper limit of HGT for established families, we calculate that gene transfer accounts for a maximum of 11% of a given family in E. coli (Figure 4c). However, this does not take into account families that are unique to a given strain and that may have a xenologous origin. The fact that these families are not included in the paralog plots (which display only homolog pairs between strains) suggests that they can represent transfers to a given strain. Thus, the paralog plots present a picture of stability and limited xenologous genes for already established families, but this is not inconsistent with the transfer of families that appear to be unique to a given strain or species. It could, theoretically, be more probable that gene families expand by horizontal transfers than by gene duplication . This way, xenologous genes would already confer a functionally distinct role and would avoid the neutrality period in which redundant gene copies coexist and can be eliminated . The results shown here suggest that the overrepresentation of duplications among transferred genes found by Hooper and Berg  might be a feature of these specific families but not of more ancient, homologous ones.
Species used in the current work and their accession numbers
Genome size (bp)
Agrobacterium tumefaciens str. C58 (Cereon)
Agrobacterium tumefaciens str. C58 (U. Washington)
Aquifex aeolicus VF5
Bacillus anthracis str. Ames
Bacillus cereus ATCC 14579
Bacillus subtilis subsp. subtilis str. 168
Bacteroides thetaiotaomicron VPI-5482
Bifidobacterium longum NCC2705
Borrelia burgdorferi B31
Bradyrhizobium japonicum USDA 110
Brucella melitensis 16M
Brucella suis 1330
Buchnera aphidicola str. APS (Acyrthosiphon pisum)
Buchnera aphidicola str. Bp (Baizongia pistaciae)
Buchnera aphidicola str. Sg (Schizaphis graminum)
Campylobacter jejuni subsp. jejuni NCTC 11168
Candidatus Blochmannia floridanus
Caulobacter crescentus CB15
Chlamydophila caviae GPIC
Chlamydophila pneumoniae AR39
Chlamydophila pneumoniae CWL029
Chlamydophila pneumoniae J138
Chlamydophila pneumoniae TW-183
Chlorobium tepidum TLS
Chromobacterium violaceum ATCC 12472
Clostridium perfringens str. 13
Clostridium tetani E88
Corynebacterium efficiens YS-314
Corynebacterium glutamicum ATCC 13032
Coxiella burnetii RSA 493
Enterococcus faecalis V583
Escherichia coli CFT073
Escherichia coli K12
Escherichia coli O157:H7
Escherichia coli O157:H7 EDL933
Fusobacterium nucleatum subsp. nucleatum ATCC 25586
Haemophilus ducreyi 35000HP
Haemophilus influenzae Rd
Helicobacter hepaticus ATCC 51449
Helicobacter pylori 26695
Helicobacter pylori J99
Lactobacillus plantarum WCFS1
Lactococcus lactis subsp. lactis
Leptospira interrogans serovar lai str. 56601
Listeria monocytogenes EGD-e
Mycobacterium bovis subsp. bovis AF2122/97
Mycobacterium tuberculosis CDC1551
Mycobacterium tuberculosis H37Rv
Mycoplasma gallisepticum R
Neisseria meningitidis MC58
Neisseria meningitidis Z2491
Nitrosomonas europaea ATCC 19718
Nostoc sp. PCC 7120
Oceanobacillus iheyensis HTE831
Photorhabdus luminescens subsp. laumondii TTO1
Porphyromonas gingivalis W83
Prochlorococcus marinus str. MIT 9313
Prochlorococcus marinus subsp. marinus str. CCMP1375
Prochlorococcus marinus subsp. pastoris str. CCMP1378
Pseudomonas aeruginosa PA01
Pseudomonas putida KT2440
Pseudomonas syringae pv. tomato str. DC3000
Salmonella enterica subsp. enterica serovar Typhi
Salmonella enterica subsp. enterica serovar Typhi Ty2
Salmonella typhimurium LT2
Shewanella oneidensis MR-1
Shigella flexneri 2a str. 2457T
Shigella flexneri 2a str. 301
Staphylococcus aureus subsp. aureus MW2
Staphylococcus aureus subsp. aureus Mu50
Staphylococcus aureus subsp. aureus N315
Staphylococcus epidermidis ATCC 12228
Streptococcus agalactiae 2603V/R
Streptococcus agalactiae NEM316
Streptococcus mutans UA159
Streptococcus pneumoniae R6
Streptococcus pneumoniae TIGR4
Streptococcus pyogenes M1 GAS
Streptococcus pyogenes MGAS315
Streptococcus pyogenes MGAS8232
Streptococcus pyogenes SSI-1
Streptomyces avermitilis MA-4680
Streptomyces coelicolor A3(2)
Synechococcus sp. WH 8102
Synechocystis sp. PCC 6803
Thermosynechococcus elongates BP-1
Tropheryma whipplei TW08/27
Tropheryma whipplei str. Twist
Vibrio parahaemolyticus RIMD 2210633
Vibrio vulnificus CMCP6
Vibrio vulnificus YJ016
Wigglesworthia glossinidia (from Glossina brevipalpis)
Xanthomonas axonopodis pv. citri str. 306
Xanthomonas campestris pv. campestris str. ATCC 33913
Xylella fastidiosa 9a5c
Xylella fastidiosa Temecula1
Yersinia pestis CO92
Yersinia pestis KIM
To detect potential paralogous genes, we carried out an all-against-all BLASTP  search of every protein sequence in a genome against every protein sequence in the same genome. We define paralogs as protein sequences satisfying an E-value threshold of 10-5 in BLASTP  search and having at least 30% sequence identity over more than 60% of their lengths .
When comparing paralogs between two species, a gene family was created for each homologous gene detected in both genomes. This gave rise to some redundant families but ensured that the comparison between species was done between equivalent gene families. To describe the functional assignment of paralogous genes, extended gene families were created  that contained all genes that were interrelated by hits among any of their members. This is based on the transitive nature of sequence homology  and is supported by the findings on well-studied genomes of species with a relatively well-known metabolism. In these cases, extended gene families seem to be formed by genes involved in similar functions . To minimize the incorporation of multidomain proteins in a family together with unrelated members , length cut-offs were kept at 60%. The assignment of a function to a gene was based on the Clusters of Orthologous Groups (COGs) classification .
Additional data file 1 is a PDF file of a figure showing the number of paralogs and the percentage of paralogous genes in the different-sized gene families in Pirelulla sp. compared to other large-sized genomes. Additional data file 2 is a PDF file of a figure showing gene family sizes in intracellular genomes that have undergone reductive evolution compared to related free-living organisms. Additional data file 3 contains legends to the figures in Additional data files 1 and 2. Additional data file 4 is a zip file containing the data from which the figures in the manuscript were made. The files are ordered following the figures as they appear in the text, and a readme text file explains the content of each file.
A.M. is the recipient of a 'Ramón y Cajal' research contract from the Spanish Ministry of Science and Technology (MCyT). Support from European Commission Project GEMINI (QLK3-CT-2002-02056) and MCYT project PM1999-0078 is also acknowledged. We thank Stuart Ingham for help with the graphics.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.