Complexity, connectivity, and duplicability as barriers to lateral gene transfer

Laterally transferred genes are shown to be less involved in protein-protein interactions, and essential genes that exhibit low duplicability and high connectivity do exhibit mostly vertical descent.


Background
Lateral gene transfer (LGT) is a major force in microbial evolution, driving bacterial genetic innovation and speciation [1,2]. The common intuitive notion of a lateral transfer event is an acquisition of a locus or allele with a new and potentially useful function. Indeed, it has been claimed that laterally acquired genes may only be fixed in a population if they are under strong positive selection [3]. The scarcity of transfer of genes involved in informational processes ('informational genes') such as transcription and translation was, therefore, attributed to lack of positive selection due to the inability of newly acquired proteins to interact with their pre-existing native counterparts [4]. According to this concept, designated 'the complexity hypothesis' [4], the chances of a gene to be beneficial to a new host are greatly influenced by the number of its interactions with its new neighbors -implying a direct link between complexity and LGT (Figure 1a, solid arrow).
An alternative explanation for the paucity of transfer of informational genes may be negative (purifying) selection operating against gene acquisition. For many major cellular functions, in particular essential ones, there is likely to be a homologous ancestral gene already present when a foreign ortholog is acquired. In these cases, the laterally acquired gene will have to coexist alongside its native homolog before orthologous replacement (also called xenologous gene displacement) can occur. If the foreign gene is expressed, the chances that this coexistence will be deleterious to the host is especially high when the gene's encoded protein is involved in protein-protein interactions, and even more so when it is a subunit of a protein complex. This is because the increase in concentration of one component (that is, the effect of gene dosage resulting from a single gene duplication) can either inhibit complex assembly, or form undesirable toxic interactions, as described in 'the balance hypothesis' [5]. The authors showed that genes encoding protein complex subunits have lower duplicability, that is, are less likely to have multiple paralogs in a genome. We propose that a horizontally acquired homolog can exert an even more profound negative influence, compared with a duplication product, even at low expression levels. This is because it could be similar enough to interact with a native protein and yet sufficiently different so that an aberrant interaction is formed, destabilizing a native complex in what is often called a 'dominant negative' effect. This situation will result in an indirect effect of complexity on LGT mediated through gene duplicability (Figure 1a, dotted arrows).
Here we perform several analyses aimed at testing the compatibility of the complexity hypothesis and the balance hypothesis with existing protein interaction data, using Escherichia coli as a model for prokaryotic evolution.

Protein complexity and duplicability in E. coli
Previous work in the yeast Saccharomyces cerevisiae demonstrated that protein complexity (number of subunits in a protein) and gene duplicability are inversely correlated, which was attributed to the effects of gene dosage [5]. Accordingly, members of large gene families (having at least three paralogs) were shown to be significantly under-represented in yeast protein complexes when compared to singletons [5]. We tested whether a similar trend exists in the model prokaryote E. coli, which is somewhat less complex than yeast on the organismal level. All available subunit data for E. coli were retrieved from SwissProt [6], and family membership for each gene was obtained from the EMU server (see Materials and methods). The comparison showed that, in E. coli, gene families with three or more paralogs are in fact under-represented in monomers (20.44%) and not, as expected, in hetero-oligomeric protein complexes, where their fraction (29.98%) is very similar to that of the whole genome (31.24%). Furthermore, the data indicated that unlike yeast [7], a group composed of characterized monomers and homooligomers had a similar fraction of singletons (Q) to that of the whole E. coli proteome, and monomers of E. coli even have a higher Q value than the genomic fraction (Table 1). Notably, whereas in Saccharomyces the Q value for heterooligomeric proteins of medium size and above (having at least three subunits of more than one type of polypeptide) was very high, the corresponding Q value in E. coli was not higher than that observed for monomers in the latter organism. Thus, current E. coli protein data do not support a link between gene duplicability and membership in protein complexes.
The lack of correlation between protein complexity and duplicability in E. coli does not support the balance hypothesis for this organism, which may hint at different selective forces on Alternative scenarios for the effect of complexity and duplicability on lateral gene transfer complexity in prokaryotic genomes to those operating in eukaryotes. This finding is in agreement with recent findings by Ochman and colleagues [8], demonstrating differences in protein interaction network evolution between eukaryotes and prokaryotes.
Protein-protein interactions outside multimeric proteins also contribute greatly to the complexity of an organism. However, unlike subunits in a protein complex, external interactions usually do not require precise stoichiometry and may, consequently, be less sensitive to dosage effects. Therefore, one would expect these interactions not to be linked to duplicability. Indeed, there was no significant difference in Q for proteins having a single interaction versus those with multiple interactions in E. coli (p = 0.21). Recently, a high-throughput analysis of protein-protein interactions in E. coli has been carried out using a His-tagged clone library pull-down [9] that is more sensitive than the TAP-or SPA-tagged bait protein approach used previously [10] and thus provides interaction data for most E. coli proteins. High-throughput methods cannot distinguish between interactions of subunits within a stable protein complex and those that are external to the complex. Therefore, the data include the cumulative contribution of both types of interaction to complexity. This category of combined interactions will henceforth be referred to as connectivity. In agreement with the SwissProt data, the Q value of proteins with more than one interaction in the pull-down study (Table 1) was not significantly higher than that of proteins with a single interaction (p = 0.71). Nevertheless, an analysis of the number of interacting partners of these proteins showed that, in E. coli, singletons have significantly more partners (a higher connectivity) than do proteins with paralogs (4.474 versus 4.095, p = 0.01), so some correlation between connectivity and duplicability does exist.

Characterized protein complexes are not resistant to transfer
Correlating complexity or duplicability with LGT on a genomic scale is complicated by the fact that different LGT detection methods often identify different subsets of genes [11,12]. Therefore, for our analysis we relied on two different datasets: genes identified as being acquired by LGT by Bayesian phylogenetic analysis [12,13] and genes identified as transferred due to atypical nucleotide composition [11,14]. This last group of genes is thought to represent relatively recent transfer events [11,12]. In order to address general 'transferability' of genes in microbial evolution, rather than just specific origin of genes in E. coli, we used an established global estimator, the 'phylogenetically discordant sequence' (PDS) metric. This parameter measures the extent to which a protein's phylogenetic signal matches most other proteins' phylogenetic signals in a genome by examining its similarity to its reciprocal best matches in other genomes [15,16]. Values range from 0 to 1, where a totally concordant sequence has a score of 1, and a highly discordant protein has a score of 0. It is important to note that a gene that is vertically derived in E. coli but has been involved in many LGT events in other taxa could have a low PDS score, due to its irregular pattern, which is appropriate for global assessment of a gene's propensity for LGT. Also, transfers within closely related organisms will generally affect PDS to a lesser extent than transfers between remote taxa [15].
Having a foreign variant of a protein could have a destabilizing effect on a complex, even a homo-oligomeric one, resulting in a dominant negative phenotype. If selection against LGT is mediated by such a dominant negative mechanism, one would expect to observe a difference between the LGT propensity of monomers, which should be free of such  Table 2) with SwissProt subunit data (see Materials and methods) we observed no significant difference in PDS between monomers and other homo-oligomers (p = 0.056), indicating no support for such a role in bacterial protein evolution. When a protein complex involves more than one type of polypeptide, the chance for a negative influence of a foreign homolog may increase due to the effect of gene dosage -the requirement of a precise molar ratio between subunits in order to guarantee a functional complex [5]. However, comparing PDS values between hetero-oligomeric complexes and the group containing monomers/homo-oligomers showed only an insignificant increase in the average PDS value (p = 0.16). It appears, therefore, that protein complexes are seldom a major barrier to transfer, unless essential complexes are involved (see below). A possible explanation for this observation is that genes encoding complex components tend to be located adjacent to each other in prokaryotes. Thus, a lateral transfer event can, in principle, transfer the whole complex as a unit [17], be it on a plasmid or genomic island. A good example of an extremely large and complex structure that has been frequently transferred is the virulence-associated bacterial type III secretion systems [18]. Similarly, a global survey of LGT across microbial genomes has shown microbial surface structures such as pili, which are often multimeric, to be frequent products of transfer [19]. Pili-encoding genes are nearly always found in operons, so the combination of a function that improves fitness in a niche and location appears to be more potent than the negative effects of complexity, if any, in determining transferability.

Frequently transferred genes have fewer external interaction partners
Although the original complexity hypothesis was mostly focused on protein complexes, the authors nevertheless left room for other interactions. Indeed, Jain and colleagues [4] stated that "... the probability of a successful horizontal trans-fer will be strongly affected by the number of interactions that a protein must make with its neighbors." Thus, we suggest that the scope of the complexity hypothesis should be expanded to include all connectivity, rather than complexity. Characterized E. coli proteins that were found to be involved in a single interaction were, therefore, compared to proteins with multiple interactions (Table 2). Notably, proteins with multiple interactions have a higher average PDS value (0.845) than proteins with a single interaction (0.766), and the difference is significant (p = 0.00005). A similar trend was observed for the pull-down data, but was not significant (PDS scores of 0.732 and 0.711, respectively, p = 0.086). Thus, genes that are more frequently transferred in evolution tend to have lower connectivity, in agreement with our broader definition of the complexity hypothesis. The fractions of laterally transferred interacting genes (one interaction or more) in E. coli, identified by either composition or Bayesian phylogeny, were not significantly different from the average for E. coli. However, high-throughput interaction data (that includes interactions within a protein complex, see above) indicate that although transferred genes identified by Bayesian phylogeny have a higher connectivity average that is not significant (4.495 versus the E. coli average of 4.305, p = 0.311), the acquired genes with atypical composition, assumed to be more recent arrivals in the genome, have a significantly lower number of interactions (3.9922, p = 0.049). It therefore appears that genes that are more recent arrivals in a genome have lower connectivity than the rest of the genes and probably have not integrated fully into the genome's interaction network. Thus, the broader sense of the complexity hypothesis is again in agreement with the data. The evolutionary mechanism behind our observation remains unclear -is it that genes that have to interact with multiple partners are seldom retained or do transferred genes just gradually adapt to the new network? We feel our findings regarding recently transferred genes support the latter explanation, but the former alternative cannot altogether be rejected.

Conclusion
Our findings shed new light on current paradigms on transferability and duplicability of genes in prokaryotes. We propose that duplicability and, to a lesser degree, connectivity, can directly affect the fixation of laterally transferred genes in prokaryotic genomes (Figure 1b). We expand the complexity hypothesis to include general connectivity and show that, in its strictest sense, it applies mostly to essential complexes.
Many phylogenetic studies base their analysis on single copy genes to avoid problems in discerning orthology. Based on the findings presented here it is likely that this practice in fact filters out many laterally transferred genes. While this may be desirable when reconstructing organismal phylogenies ('trees of life') based on a non-transferred core [23], it is highly inappropriate when assessing the impact of LGT on different genomes.

Identification of singletons and gene families
A dataset of all 4,302 E. coli K12 MG 1655 proteins was retrieved from the EMU web service [24]. We identified 480 gene families using the 'Genome query for gene families' query with a BLAST threshold of e -10 ; 1,920 genes were obtained that belonged to gene families. Subtraction of these genes from the dataset of all protein-coding genes resulted in 2,382 singletons.

Identification of subunits of protein complexes and external interactions
Protein complex and protein interaction data were automatically retrieved from SwissProt/TrEMBL and manually sorted. Complex subunit information (SwissProt field 'Subunit') was obtained for 1,003 genes, out of which 137 genes could unequivocally be classified as monomers, 459 as homo-oligomers and 407 as hetero-oligomers. In addition, all homo-oligomers that formed a homotrimer or more complex structure were grouped as mid-to-large complexes. External protein interaction data (SwissProt field 'Interaction') were similarly retrieved from SwissProt. High-throughput interaction data were retrieved from a recent E. coli pull-down study [9].

Identification of essential genes
We obtained 232 protein-coding essential genes for E. coli K-12 from the PEC (profiling of the E. coli chromosome) website [25].

Phylogenetically discordant sequences determination
The PDS metric for E. coli proteins [15] was determined using the 'Sorted lists of ORF characteristics' query of the EMU web service. PDS values are based on the data from 352 microbial genomes available in EMU in September 2006 (Additional data file 1).

Statistical analysis
The SPSS statistics package version 12 (SPSS Inc., Chicago, IL, USA) was used. Significance scores for comparisons of fractions of singletons were determined using the chi-square test. Significance for comparisons of PDS values and interaction partner numbers were determined using the Mann-Whitney-Wilcoxon U test.