Novel genes exhibit distinct patterns of function acquisition and network integration

Background Genes are created by a variety of evolutionary processes, some of which generate duplicate copies of an entire gene, while others rearrange pre-existing genetic elements or co-opt previously non-coding sequence to create genes with 'novel' sequences. These novel genes are thought to contribute to distinct phenotypes that distinguish organisms. The creation, evolution, and function of duplicated genes are well-studied; however, the genesis and early evolution of novel genes are not well-characterized. We developed a computational approach to investigate these issues by integrating genome-wide comparative phylogenetic analysis with functional and interaction data derived from small-scale and high-throughput experiments. Results We examine the function and evolution of new genes in the yeast Saccharomyces cerevisiae. We observed significant differences in the functional attributes and interactions of genes created at different times and by different mechanisms. Novel genes are initially less integrated into cellular networks than duplicate genes, but they appear to gain functions and interactions more quickly than duplicates. Recently created duplicated genes show evidence of adapting existing functions to environmental changes, while young novel genes do not exhibit enrichment for any particular functions. Finally, we found a significant preference for genes to interact with other genes of similar age and origin. Conclusions Our results suggest a strong relationship between how and when genes are created and the roles they play in the cell. Overall, genes tend to become more integrated into the functional networks of the cell with time, but the dynamics of this process differ significantly between duplicate and novel genes.


S1.1 Definition of Gene Origin
Our classification of S. cerervisiae genes into mechanism of origin groups relies on gene families and evolutionary histories reconstructed across related species. Fully reconstructing these complex sequences of evolutionary events is a very difficult problem; gene loss, fusion, fission, and rearrangement can obscure the origins of a gene. Since this is an area of active research and a variety of methods have been developed for inferring gene families and evolutionary histories, we tested the sensitivity of our conclusions to the use of several computational methods for these tasks.
We investigated two different strategies for gene origin classification. The first approach considers the existence of paralogs of a gene in the same species based on a particular gene family definition. All genes with paralogs in the species are assigned to the duplicate category and all other genes are assigned to the novel category. The results using families defined by a Jaccard clustering algorithm are presented in the main text, and results on two additional family definitions (from OrthoMCL and InParanoid) are given below. The second approach to origin classification uses predicted evolutionary histories for each family. We take the histories predicted for each gene across 23 fungal species by the Synergy algorithm [42]. Any gene with a duplication on the path from it to the root of the tree or with a homologous orthogroup is assigned to duplicate; all other genes are assigned to novel. We demonstrate here that the conclusions presented in the main text hold across all these methodological variations with a few minor exceptions.

S1.1.1 Gene Evolutionary History
Synergy Evolutionary histories and gene trees have been generated for all genes in S. cerevisiae by the Synergy algorithm [25,42]. Synergy builds "orthogroups" of genes derived from a common ancestor by combining analysis of sequence similarity and gene synteny. We downloaded the predicted orthogroups and gene trees from version 1.1 of the Fungal Orthogroups web site on October 19, 2009. Synergy's predictions were not in complete agreement with those of the family-based method ( Figure S1); 76% (4358 of 5770) of the assignments agree. It should also be noted that the Synergy algorithm considered several additional genomes to those used in the ancestral reconstruction of Gordon et al. [39]. However, these differences did not dramatically affect our conclusions. Table S1 demonstrates that young novel genes are still dramatically shorter and less functionally annotated than the other groups. In the Synergy-based analysis, young duplicates are still significantly less essential, less annotated, and less integrated into interaction networks than old duplicates (Table S1). Figure S2 shows that the significant preference for proteins to interact with other proteins of the same age and origin is maintained in this data set.
We also performed the functional and network analysis on the 4358 genes for which the Synergy and Jaccard clustering age/origin assignments agreed. Table S2 and Figure S3 show that these results resemble those observed using either approach alone and support our main conclusions that: young genes are less functionally integrated into the cell than old genes; young novel genes are particularly short and peripheral in function and interactions; and genes in every group are more likely to interact with other genes in the same group than expected.

S1.1.2 Gene Family Definition
The Princeton Protein Orthology Database (PPOD) [40] provides predictions of homologous families from three different algorithms: OrthoMCL [87], MultiParanoid [88], and a Jaccard clustering-based approach. The Jaccard clustering approach was used in the main text because we found it assigned the highest percentage of known WGD duplicates into the same families (85% v. 40-50%). We now give the results for MultiParanoid and OrthoMCL, which are intended to predict smaller orthologous groups across species. In general, the results are similar; however, there are a few differences as a result of the different families predicted by these methods. From our analysis of WGD duplicates, we expect these two other methods to more frequently incorrectly characterize duplicate genes as novel than the Jaccard clustering approach.

OrthoMCL
The main conclusions of our analysis are all supported when gene families from OrthoMCL are used in the age/origin assignment. One notable difference is that the average length of young novel genes is noticeably longer than when the Jaccard clustering approach is used. We suspect that this is the result of a number of diverged duplicate genes not being recognized as duplicates and thus being included in the novel group (Table S3). However, the length of the young novel genes is still significantly less than that of the older novel genes, and all other functional and interaction patterns are maintained. The preference of genes to interact with other genes of the same origin and age is also found in this classification ( Figure S4), though the preference observed among young novel genes is not significant (p=0.082).

MultiParanoid
Similarly, the main conclusions of our analysis are supported when gene families from MultiParanoid are used in the age/origin assignment (Table S4, Figure S5). However, in this case, the young duplicate genes are nearly as long as the older duplicates, but as before they have significantly fewer interactions and are far less essential. As for OrthoMCL, this may be the result of diverged duplicates not being recognized and thus being assigned to novel groups.
The overall similarity of the results on these independent data sets and classifications strongly supports our conclusions.

Effect of Subtelomeric Genes
Subtelomeric regions are very dynamic; they experience a large number of rearrangements and duplications. As a result, the ancestral reconstruction of Gordon et al. [39] did not include these regions. We wanted to consider these genes in our analysis, because many lineage-specific genes appear to be born and amplified in these regions. Since we could not perform the pre-WGD ancestor-based age classification on subtelomeric genes, we aged them using alignments of orthologs generated by the SGD (see Methods in main text).
To demonstrate that the patterns we observe among young novel and young duplicate genes are not specific to those found in subtelomeric regions, we repeated the analysis without these genes. The number of young genes is greatly reduced, but where there is sufficient data, the same patterns are apparent (Table S5, Figure S6).
The enrichment for Gene Ontology functional terms related to environmental response was maintained when subtelomeric young duplicate genes were excluded from the analysis. However, the enrichment for carbohydrate processing genes was lost. This argues that the recent innovation in these functions has been focused in subtelomeric regions. The full list of enriched terms when excluding subtelomeric genes is given in Table S6.

Effect of Essential Genes
Essential proteins have been found to participate in more interactions than non-essential proteins [58,59]. Since older genes are more likely to be essential than younger genes, we repeated our analysis excluding essential genes to test if these old genes carrying out essential functions are responsible for the increase in interactions observed for older genes in the network. Table S7 and Figure S7 demonstrate that the same relationship between the age of a protein and its cellular context was found without essential genes as when essential genes were considered.

Inference of Ancestral Duplicate Copy
Selecting which gene among a set of duplicates is the ancestral copy is often very difficult-particularly in the case of tandem duplicates [38]. Further complicating this task, there is no guarantee that the initial member of the family is still present in the genome. In our analysis, we dealt with this situation by assigning all genes that had experienced a duplication, the members of each homologous family, to the duplicate class.
To explore the effect of this choice on our results, we tested another strategy in which we selected the oldest gene from each a homologous family to serve as the progenitor of the family. The oldest gene was defined as the gene with the most distant homolog in the YGOB (or SGD alignments for subtelomeric genes). If there was more than one oldest gene, a progenitor was selected randomly among them. This gene was assigned to the novel class. If more than one oldest gene existed, we selected randomly among them. Table S8 and Figure S8 demonstrate that our conclusions hold on this adapted classification.

S1.2 Protein-Protein Interaction Networks
The results presented in the main text reflect the integration of proteins from each age/origin class into a physical protein-protein interaction network consisting of a combination of interaction data from small-scale experiments and high-throughput studies collected in the Database of Interacting Proteins (DIP) [56]. The next several sections show that these conclusions hold on different interaction datasets.

BioGRID [82] is a repository for protein interaction data. Kim and Marcotte [53] following Batada et al. [89]
used specialized filters and confidence measures to build a network combining high-throughput and literaturecurated interactions from BioGRID. This network contains fewer interactions for young proteins than DIP, but our conclusions hold on this interaction network as well. Table S9 shows that young genes are less integrated into the network. Figure S9 shows that the preference for proteins to interact with other proteins of the same age and origin is also maintained. However, no interactions between young novel proteins were observed in this filtered network.

High-throughput Only
The presence of interactions inferred from small-scale studies could introduce a bias toward interactions involving well-studied proteins into the network. To test the impact of this potential bias, we analyzed the high-throughput only subnetwork of the Kim and Marcotte [53] network, which is easily divided into a literature-curated interaction set and a set determined by high-throughput experimental methods. We obtained similar results when only interactions determined by high-throughput studies were considered. Most notably, young genes are still less integrated into the network than older genes, and young novel genes are the most peripheral (Table S10). The greater network integration of older proteins does not appear to be an artifact of experimental bias. In this reduced set of interactions, there were no interactions within the young protein groups ( Figure S10). Overall, we did not observe a significant difference in the percentage of interactions involving young or novel proteins between the high-throughput and literature-curated sets.

S2.1 GO Functional Enrichment of Young Genes
In the main text we summarized the results of GO annotation enrichment analysis among the groups of young genes. No significant enrichment was found among the young novel genes, but many terms related to environmental response and carbohydrate processing were enriched among the young duplicate genes. The complete lists of enriched terms from each hierarchy are given in Tables S12-S14. See Section S1.1.3 for a discussion of the impact of subtelomeric genes on functional enrichment.

S2.2 A More Specific Classification of Gene Age
We also considered a more specific temporal classification of the pre-WGD genes into two age groups: 1) those created prior to the divergence of S. cerevisiae and Schizosaccharomyces pombe (pre-WGD-ancestral) and 2) those created after this divergence but before the WGD (pre-WGD-post-pombe). All genes from the pre-WGD age group described in the main text were assigned to either pre-WGD-ancestral or pre-WGD-post-pombe based on their presence or absence in a homologous family in S. pombe [83].
Our main conclusions are maintained on this more specific age grouping. The functional properties of genes in the pre-WGD-post-pombe group fall in between the post-WGD and pre-WGD-ancestral groups (Table S11, Figure S11). This additional temporal data point adds strong support to our conclusion that on average genes gain functions and interactions over time. Similarly, the pattern of genes to preferentially interact with other genes of the same age and mechanism of origin is also maintained under this finer classification. This preference is significant for all groups, except the pre-WGD-post-pombe/duplicate proteins which also interact with one another more often than expected by chance, but this effect was not significant (p = 0.13). pre-WGD/duplicate pre-WGD/novel post-WGD/novel post-WGD/duplicate Figure S1: Overlap of Synergy-based and Jaccard clustering-based age/origin classification. The mechanism of origin of each gene in S. cerevisiae was predicted using the Jaccard clustering family approach and the Synergy gene tree approach. The age of each gene was predicted as described in the main text. WGD proteins are not listed because they did not differ between the classifications.  Figure S2: Significance of interaction preferences when protein origin is predicted from Synergy orthogroups. All groups and statistics are as in Figure 5 of the main text. As we observed with the age groups used in the main text, the red trend across the diagonal reflects the significant preference for proteins to interact within their age/origin group. The only significant enrichment for interactions between proteins of different age or origin is among young (post-WGD) proteins.  Figure S3: Significance of interaction preferences when only proteins with agreeing age/origin assignments from Synergy and Jaccard clustering are considered. All groups and statistics are as in Figure 5 of the main text. As we observed with the age groups used in the main text, the red trend across the diagonal reflects the significant preference for proteins to interact within their age/origin group.  Figure S4: Significance of interaction preferences by protein age and origin with gene families defined by OrthoMCL. All groups and statistics are as in Figure 5 of Figure S5: Significance of interaction preferences by protein age and origin with gene families defined by MultiParanoid. All groups and statistics are as in Figure 5 of Figure S6: Significance of interaction preferences by protein age and origin when subtelomeric genes are not considered. All groups and statistics are as in Figure 5 of the main text. No interactions were observed between non-subtelomeric post-WGD/novel and post-WGD/duplicate genes.  Figure S7: Significance of interaction preferences by protein age and origin when essential genes are not considered. All groups and statistics are as in Figure 5 of Figure S8: Significance of interaction preferences by protein age and origin when selecting a progenitor for each gene family. All other groups and statistics are as in Figure 5 of the main text. As we observed with the age groups used in the main text, the red trend across the diagonal reflects the significant preference for proteins to interact within their age/origin group. The only significant enrichment for interactions between proteins of different age or origin is among young (post-WGD) proteins.  Figure S9: Significance of interaction preferences by protein age and origin on the filtered BioGRID network. All groups and statistics are as in Figure 5 of the main text. As we observed with the age groups used in the main text, the red trend across the diagonal reflects the significant preference for proteins to interact within their age/origin group. No interactions were observed within the post-WGD/novel group.  Table S5: Average functional and interaction properties for age/origin gene groups when subtelomeric genes are not considered.    Table S11: Average functional and interaction properties for age/origin gene groups with an additional age category. In this table, pre-WGD-post-pombe genes are those gained prior to the WGD, but after the divergence of S. cerevisiae and S. pombe. pre-WGD-ancestral genes are those gained prior to this divergence. The WGD/duplicate genes are not necessarily expected to follow the temporal patterns of other duplicate genes as the pressures following the WGD were likely very different than following a small-scale duplication.