What makes species unique? The contribution of proteins with obscure features

An analysis of proteins with obscure features in ten eukaryotic genomes revealed that the majority are species-specific.

revealed a high degree of similarity among different species, it has been accepted widely that the uniqueness of a particular species was driven by changes in regulatory genes or elements [1][2][3][4][5][6], as opposed to the divergence of established coding sequences or the creation of new genes. This has led to a widespread perspective that just a few model organisms can provide the experimental foundation to assign functions to nearly every eukaryotic gene.
Noticeably lacking from the comparative analysis of eukaryotic genomes to date, however, is an analysis of the origins and functions of genes encoding proteins that currently lack defined motifs or domains (proteins with obscure features (POFs)) [7,8]. Expression profiling studies in different organisms suggested that POFs play an important role in many different biological processes. Nevertheless, their biological roles and origins remain poorly understood and elucidating their functions is currently a major goal of biological research in almost all organisms studied [7,8]. In this paper we examine the possibility that genes encoding POFs, which account for approximately one-quarter of all eukaryotic genes, play a role in determining differences among species. By analogy to the expectation that PDFs are often conserved among species [4][5][6], one might expect that POFs would show a parallel pattern of phylogenetic conservation. To test this assumption, we performed a comparative analysis of 10 different eukaryotic proteomes, including budding and fission yeast, worm, fruit fly, mosquito, Arabidopsis, rice, mouse, rat, and human. Surprisingly, in contrast to PDFs, we found that POFs include a much larger percentage of proteins that are highly divergent. Our results underscore the importance of delineating the origins and functions of POFs as an underlying cause of species specificity.

Approximately one-quarter of eukaryotic proteins are POFs
Proteins were analyzed from ten different model proteomes and classified as POFs if they lacked an established domain or motif including domains of unknown function. The ten model    Figure 1, between 18% and 38% of all proteins (average 26%) predicted from each genome were classified as POFs.

POFs are more divergent than PDFs
To evaluate the diversity among PDFs and POFs in different proteomes, we compared their sequence relatedness to each other using BLAST ( Figure 2; Figures 1S to 3S in Additional data file 1). The percentage of related proteins was plotted as a function of similarity cutoff thresholds that ranged from non-stringent (BLAST E-values greater than 10 -6 ) to stringent (from 10 -9 to 10 -80 or less). This method of plotting similarity differences permits the visualization of reproducible differences between PDFs and POFs across a wide range of cutoff thresholds. Unless noted otherwise, a BLAST similarity of greater than 10 -6 was used as the cutoff threshold for classifying a sequence as related. Using this similarity threshold, a total of 1,650 protein groups were found to be conserved among all 10 proteomes (Tables 1S and 2S in Additional data file 1). Surprisingly, only 3 of those (<0.2%) were POFs (as represented in S. cerevisiae by gi|6319274|, gi|6320573| and gi|6324048|).
Among the 10 proteomes, POFs always showed significantly more divergence than PDFs, as illustrated for S. cerevisiae and S. pombe, or M. musculus and R. norvegicus in Figure 2a This relatively high divergence within the group of POFs was also true in a parallel analysis in which only proteins with essential functions were compared. In Figure 2c, the essential POFs and PDFs were compared from proteomes of Ce [9] and Sc [10]. These two organisms have been subjected to systematic deletion or RNAi analyses to define essential genes. In this comparison, essential POFs were still 3 to 9 orders of magnitude more dissimilar as shown by a comparison between the 50% similarity points.
For each proteome we identified the set of unique proteins not found in any of the other nine proteomes analyzed (Tables  3S and 4S in Additional data file 1). We then determined the relative proportion of POFs or PDFs designated as unique (BLAST cutoff of 10 -6 ). As shown in Figure 3a, the relative percentage of POFs designated as unique was always higher than the percentage of PDFs. On average, we found that 60% POFs are more likely to be species specific than PDFs Figure 3 POFs are more likely to be species specific than PDFs. POFs are more likely to be species specific than PDFs among 10 different proteomes (S. Average sequence similarity relationship trees constructed for all 10 proteomes, based on POFs or PDFs (Figure 3b), revealed that the divergence of POFs among the 10 different proteomes was consistently greater than that of PDFs, supporting the contention that POFs account for the majority of phylogenetically specific ORFs (Figure 3a).

Comparative analysis of the human and chimpanzee proteomes
The recent publication of a draft chimpanzee (Pan troglodytes (Pt)) genome [11] provided us with a unique opportunity to compare the similarity of POFs and PDFs encoded in two proteomes that are estimated to have diverged only 5 to 7 million years ago. Compared to the degree of similarity among POFs from human and mouse, the degree of identity among POFs from human and chimpanzee was much higher ( Figure 3S in Additional data file 1). To examine what proportion of human-specific proteins are POFs or PDFs we performed a BLAST search (E-value cutoff of 10 -6 ) of all published sequences against the human proteome. Consistent with POFs representing the majority of species-specific proteins (Figure 3a), POFs accounted for all 27 expressed human-specific proteins ( Table 5S in Additional data file 1) not observed in the genomes of any other organism.

Relative contribution of POFs to biological functions
To evaluate POFs for their functional relevance, we first compared the percentage of PDFs and POFs that were represented in expressed sequence tag (EST) collections ( Figure  4a). The six animal transcriptomes all showed a greater than 95% representation for transcripts encoding POFs. This high representation was only slightly less than that observed for PDFs. Whereas the two plant transcriptomes showed a somewhat larger difference, the POFs still showed a representation of greater than 75%. Thus, POFs are similar to PDFs in their representation as actively expressed mRNAs.
To compare the relative contributions of PDFs and POFs to protein-protein interaction networks, we examined the percentage representation of each protein class in global interaction data sets available for the Sc [12] and Ce [13] proteomes ( Figure 4b). While the POFs showed a slightly reduced representation compared to PDFs, the relative differences were less than 7%.
To compare the relative phenotypic contribution of both protein groups, we examined the percent representation of corresponding mutant phenotypes from the genome-wide functional analyses conducted for S. cerevisiae (Sc) [10] (Figure 4c). With the exception of a potentially noteworthy twofold lower representation of POFs in the essential gene categories, PDFs and POFs showed a similar percent contribution to other phenotypic categories. Similar results were found with C. elegans genome-wide functional analysis (Ce) [9] (data not shown).
Together, the findings presented in Figure 4 suggest that POFs, as a group, are not being mis-represented by an unusually high percentage of proteins being incorrectly predicted from inaccurate gene models. Rather, POFs appear comparable to PDFs in their relative contribution to an organism's repertoire of functional proteins. To investigate if there are any structural characteristics other than established motifs and domains that might distinguish POFs from PDFs, the physical properties of the two groups of proteins were examined. Compared to PDFs, POFs as a group are 40% shorter (Figure 5a; ANOVA p < 0.001), have a higher percentage of disordered structure (Figure 5b; ANOVA p < 0.001), a higher content of hydrophilic residues (Figure 5c; ANOVA p < 0.001), a higher content of small amino acids (for example, proline and serine), glutamine, and arginine and a lower content of aliphatic (for example, isoleucine and valine) and aromatic (tyrosine) amino acids, and aspartic acid (Table  6S in Additional data file 1; ANOVA p < 0.01). Therefore, in addition to the absence of established motifs and domains, POFs, on average, have physical characteristics that further distinguish them from PDFs.

Discussion
Assigning a role to proteins with unknown function is a major goal of current and future genomic research. Homology searches have traditionally been used to assign specific domain structures to proteins, thereby classifying them into protein families with putative function [5,6]. The distinction made in this paper between proteins with defined motifs or domains (PDFs), and those with undefined or obscure features (POFs) (Figure 1), underlined proteins that could not be assigned any known function by homology searches. Interestingly, POFs as a group were found to be shorter, more hydrophilic and more disordered than PDFs ( Figure 5).
Our analysis of 10 different proteomes (Figures 1 to 3; Figures 1s to 3s in Additional data file 1) revealed a striking difference in conservation between PDFs and POFs. E-value plots show that this difference is evident whether similarities are measured with stringent or non-stringent criteria. With a minimum E-value similarity threshold of 10 -6 , a total of 44,236 phylogenetically specific POFs were identified (Figure 3a). In contrast, relatively few (17,544) PDFs were identified as phylogenetically specific. The opposite trend was observed for conserved proteins. Only 3 POF groups were conserved in all 10 proteomes, in contrast to 1,650 PDF groups (Tables 1S and  2S in Additional data file 1). In total, 60% of POFs appear to be phylogenetically restricted, in contrast to only 7.5% of PDFs.
One explanation as to why POFs show a higher degree of species specificity than PDFs is that POFs, in contrast to PDFs, could include a disproportionately higher number of proteins incorrectly predicted from pseudogenes or incorrect gene models. This could result in an artifact in which random sequences or non-functional proteins distort the overall diversity of this group of proteins. However, several lines of evidence presented here (Figures 2, 4 and 5) suggest that this trivial explanation is not the case. For example, even in the relatively well-characterized yeast genomes, the pattern of higher sequence diversification among the POFs is consistent with that seen in the larger more complex genomes. Furthermore, as a group, POFs appear similar to PDFs with respect to mRNA expression, phenotypic penetrance, and involvement in protein-protein interactions.
An additional and noteworthy characteristic supporting the contention that most POFs contribute functional activities is that, on average, POFs show a greater degree of predicted dis-  Figure 5). Empirical definitions of disordered structures have been derived from examining regions of proteins that fail to show a consistent or defined structure in a crystallized protein. These regions of disorder show a strong correlation with biochemical studies that suggest their involvement in protein-protein interactions, as well as in providing key regions for regulating a protein's activity via a structural conformation switch [14][15][16][17][18]. Importantly, the disorder prediction software programs used here do not provide high scores to random 'junk' DNA sequences, providing another line of evidence that gene models encoding POFs were not just artifactual predictions derived from 'junk' DNA. Rather, the high levels of disordered structures in POFs support their potential roles in regulatory networks in which protein conformational changes or protein-protein interactions are key.

Distinct differences in several biophysical characteristics between POFs and PDFs
Together, the above arguments strongly support the view that the average POF is just as likely to have a biological function as a protein with a defined motif or domain. We favor, therefore, a genetic explanation for the unusual diversity within the group of POFs. That explanation is that genes encoding the majority of POFs are arising de novo or diverging at an evolutionary rate much higher than genes encoding PDFs. In support of this possibility, POFs are consistently more divergent among different proteomes (Figure 3b), and are preferentially represented as singletons in the different genomes ( Figure 4S in Additional data file 1).
There may be several distinct mechanisms contributing to the genetic diversity of POFs. For example, some POFs may have structures that are highly flexible and can diverge with few structural constraints. By contrast, some POFs may have conserved tertiary structures, but are nevertheless showing rapid divergence in their primary sequence. In either case, a widely conserved motif or distinct domain signature may never be found within the primary sequences of a large subset of the currently defined POFs. Nevertheless, some POFs may ultimately be found to have definable features. One reason that these features currently remain undefined may be related to the sociology of science. In general, scientists have focused their molecular research on relatively few organisms, and devoted most of their resources to in-depth studies of relatively few proteins. Those proteins or pathways are often chosen because of their general relevance to fundamental questions in a broad group of organisms or because these proteins exhibit strong evolutionary conservation and are judged on this basis to have greater intrinsic functional relevance. By contrast, the study of a species-specific protein is often a lonely pursuit. Another source of bias lies in the tendency inherent in classical biochemical methods, which are strongly biased towards the production and characterization of folded, active proteins that have highly ordered structures (for example, PDFs) and for which structural information is more readily obtained [19]. In contrast, disordered proteins (for example, POFs) are less well studied because they lack a read-ily recognized activity and structural information is more difficult to obtain for these proteins.
Previous work identified a class of proteins termed 'ORFans' that have no significant sequence similarity to any other open reading frame (ORF) and are, therefore, unique to a specific organism [20][21][22]. In contrast to the definition of POFs that is based on the presence of an observed domain or motif, the definition of an ORFan is based strictly on sequence homology. Thus, ORFans could include POFs as well as PDFs. Indeed, as shown in Figure 3a, POFs and PDFs accounted for 70.4% (42,218) and 29.6% (17,544) of all proteins unique among the 10 analyzed proteomes, respectively. Moreover, the majority of POFs from Mm and Rn were found to be similar (Figure 3a), suggesting that although some overlap exists between ORFans and POFs, homologs of many POFs can be found in similar genomes (Figure 3a). An interesting observation that was recently made for ORFans could also hold true for POFs. It was observed that some ORFans, although demonstrating no sequence homology to any known protein, could fold into a three-dimensional structure that resembled a protein with a known function [22]. In addition to being novel genes unique to an organism or a lineage, some POFs or ORFans could, therefore, be the result of convergent evolution. Thus, they might be distant members of known proteins, with similar functions and three-dimensional structure, but with sequences that have diverged beyond recognition.

Conclusion
The advent of genome sequences has reinvigorated an effort to understand the origins of species specificity. This is a daunting challenge, emphasized by the fact that in the 10 proteomes analyzed here we identified 44,236 phylogenetically specific proteins with undefined or obscure features (POFs). In contrast to PDFs, which have established domains or motifs that can be used to formulate working hypotheses about a protein's function, advancing our understanding of POFs must proceed without such clues. Our analysis here provides an expectation that, on average, 60% of a eukaryote's set of POFs will be highly divergent, and that functional studies will ultimately need to be conducted on a species-specific basis. For example, the human genome encodes 27 proteins that currently cannot be found in genome sequences of any model organism, including the chimpanzee sequence (Table 5S in Additional data file 1). Consistent with expectations from this study, these human-specific proteins are all POFs. Eventually, the function of these unique proteins will need to be studied in humans. Our results support a general expectation that to understand the unique biology of a given organism will ultimately involve understanding the functions of an unexpectedly large number of proteins that have: no defined motifs or domains; are likely to have significant regions of disordered structure; and are restricted to a single species or a closely related phylogenetic branch. To standardize the classification for which proteins are POFs and which are PDFs, we applied a consistent analysis method to all genes regardless of their current annotation. This analysis method involved an HMMPFAM [26] search against several major signature databases: PFAM [27], TIGRFAM [28], SMART [29], and Superfamily [30]. A protein sequence with a match to one or more of the models in any one of these databases, including domains of unknown function, was flagged as a PDF. Sequences with no matches to any one of the models in any database were flagged as a POF. The definition of POFs used in our work was similar to that used in [9,31].

BLAST comparisons
BLAST comparisons of PDFs and POFs among different proteomes were performed using TeraBLAST running on an accelerated DeCypher server [32]. The comparisons of PDFs and POFs between each proteome and its respective collection of ESTs or between PDFs and POFs from each proteome and all other genomes translated in all reading frames were accomplished using TBLASTn [33]. ESTs were obtained on 5 December, 2004 from NCBI [23] except for Arabidopsis, which was downloaded at the same time from TAIR [24]. To examine the representation of PDFs and POFs from Sc or Ce in existing phenotypic studies, or existing protein-protein interaction datasets, POFs and PDFs, obtained as described above, were matched to existing datasets [9,10,12,13,34].

Prediction of protein properties
Prediction of relative disorder for PDFs and POFs was performed with the DisEMBL 1.4 prediction program [35]. Due to the large numbers of proteins that were analyzed, we used DisEMBL locally rather than at the website [36]. To obtain an overall value for the percentage of proteins that were disordered, SAS V9.1 (SAS Institute Inc., Cary, NC, USA) was used to sum the total regions that were predicted to be disordered and to divide it by the length for each protein analyzed. Hydrophilic index and amino acid content were calculated with the ad hoc perl script hydrophil.pl, developed by Garay-Arroyo et al. [37] using the Kyte-Doolittle values for hydrophilicity. SAS was used to perform statistical analysis (descriptive statistics and ANOVA) for the hydrophilic index of POFs and PDFs from hydrophil.pl results (Figure 5c), for variations in amino acid content between POFs and PDFs (Table 6S in Additional data file 1), and for sequence length and relative disorder of POFs and PDFs (Figure 5a, b). Because the average length of the POFs was shorter than that of the PDFs, a length correction was used to eliminate bias in the scoring function of the program.
'All-against-all' comparisons and tree generation 'All-against-all' comparisons used to generate sets of speciesspecific proteins for both PDFs and POFs from Sc, Sp, At, Os, Dm, Ag, Ce, Mm, Rn and Hs were performed using TeraB-LAST running on an accelerated DeCypher server [32], with a cutoff threshold of 10 -6 . A tree showing the relationships among Sc, Sp, At, Os, Dm, Ag, Ce, Mm, Rn and Hs proteomes was constructed using the reciprocal percentage of the number of genes that the organisms have in common. This tree was constructed using the SAS cluster procedure utilizing the average linkage method, and graphed using the SAS tree procedure [38]. Tree diagrams are discussed in the context of cluster analysis by Hartigan [39], and Everitt [40].

Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 contains supplemental figures and tables. Supplemental Figures 1-1 through 1-10 show the relative similarity among PDFs and POFs in all proteomes studied. Supplemental Figure 2 shows the relative similarity among PDFs and POFs in selected proteomes measured as percentage identity or percentage similarity. Supplemental Figure 3 shows the relative similarity among PDFs and POFs between Hs and Pt, compared to Hs and Mm. Supplemental Figure 4 shows cluster analysis of POFs and PDFs in selected proteomes. Supplemental Table 1 lists common POFs to all proteomes analyzed. Supplemental Table 2 lists common PDFs to all proteomes analyzed. Supplemental Table 3 lists unique PDFs from all proteomes analyzed. Supplemental Table 4 lists unique POFs from all proteomes analyzed. Supplemental Table 5 lists 27 unique Hs proteins with representation in EST databases. Supplemental Table 6 describes the amino acid content of POFs and PDFs from the different proteomes studied.
Additional data file 1 Supplemental figures and tables Supplemental Figures 1-1 through 1-10 show the relative similarity among PDFs and POFs in all proteomes studied. Click here for file