Proteome-wide evidence for enhanced positive Darwinian selection within intrinsically disordered regions in proteins
© Nilsson et al.; licensee BioMed Central Ltd. 2011
Received: 26 January 2011
Accepted: 19 July 2011
Published: 19 July 2011
Understanding the adaptive changes that alter the function of proteins during evolution is an important question for biology and medicine. The increasing number of completely sequenced genomes from closely related organisms, as well as individuals within species, facilitates systematic detection of recent selection events by means of comparative genomics.
We have used genome-wide strain-specific single nucleotide polymorphism data from 64 strains of budding yeast (Saccharomyces cerevisiae or Saccharomyces paradoxus) to determine whether adaptive positive selection is correlated with protein regions showing propensity for different classes of structure conformation. Data from phylogenetic and population genetic analysis of 3,746 gene alignments consistently shows a significantly higher degree of positive Darwinian selection in intrinsically disordered regions of proteins compared to regions of alpha helix, beta sheet or tertiary structure. Evidence of positive selection is significantly enriched in classes of proteins whose functions and molecular mechanisms can be coupled to adaptive processes and these classes tend to have a higher average content of intrinsically unstructured protein regions.
We suggest that intrinsically disordered protein regions may be important for the production and maintenance of genetic variation with adaptive potential and that they may thus be of central significance for the evolvability of the organism or cell in which they occur.
Understanding the process of adaptation is of central importance for many biological questions, such as how species respond to climate changes, pathogens or other environmental perturbations, as well for the mechanisms underlying genetic diseases, such as cancer. Evolutionary adaptation occurs when an inheritable change in the phenotype of an organism makes it more suited to its present environment. In diseases like cancer, adaptive mutations allow individual cells within multi-cellular organisms to thrive at the expense of neighbouring cells by over-riding the normal cellular controls that restrict cell growth and division. At the molecular level such phenotypic changes are the result of mutational processes acting on either protein-coding or non-coding DNA sequences. Although the neutral theory of evolution  predicts the vast majority of mutations to be either deleterious or neutral, recent years have seen a sharp increase in publications indentifying the action of positive Darwinian selection on genes in various species . The rapidly increasing number of completely sequenced genomes, along with improved bioinformatic methodologies for detecting evidence of selection [3–5], has enabled large-scale scanning of genes or genetic elements for evidence of positive selection. In particular, comparative approaches using sets of genomes from closely related species, or strains within a species, have proven powerful in detecting genes or genetic regions under recent positive selection [6–8].
SNPs are the most abundant source of genetic variation affecting populations. SNPs found within a protein-coding region may be classified as synonymous SNPs or non-synonymous SNPs, depending on whether the encoded amino acid is altered in the alternative DNA sequence variants. Non-synonymous SNPs in coding sequences, together with SNPs in gene regulatory regions, are believed to have the highest impact on phenotype  and hence they are suitable targets for studies on adaptation. However, a major task is still to understand which of the 10 million or so SNPs in the human genome are of functional significance. There is therefore a need for approaches that help to predict the subclass of SNPs that are more likely to be of adaptive significance. The relevance of this task is underscored by the International HapMap Project, which uses genetic variation as a tool to better understand the molecular basis of human disease as well as the mechanisms underlying pharmaceutical therapy .
Evolvability is often described as an organism's capacity to generate heritable phenotypic variation [11–13]. This capacity may either entail a reduction in the potential lethality of mutations or a reduction in the number of mutations required to generate phenotypically novel traits [14–17]. At the molecular level, non-synonymous SNPs in a protein-coding gene may result in structural changes in the encoded protein, which may cause phenotypic changes and an increased potential for evolutionary innovation, either directly or in future environments . Proteins consist of conformationally structured regions, containing α-helices and β-sheets, as well as intrinsically disordered regions that are conformationally flexible. Intrinsically disordered protein regions (IDRs) have been a recent focus of attention [18–21]. IDRs are abundant in the eukaryotic proteome, with an estimated 50 to 60% of all Saccharomyces cerevisiae proteins containing at least one disordered segment comprising more than 30 amino acid residues . Interestingly, IDRs occur more frequently in eukaryotes than in bacteria or archea, perhaps suggesting a role in the evolution of eukaryotes . To our knowledge, the relationship between recent adaptation and the different types of structural domains within proteins has not been systematically studied.
The budding yeast S. cerevisiae is one of the best-studied model organisms at the molecular level. It was the first eukaryotic genome to be fully sequenced , and it has a well-annotated proteome . The relatively small sizes of fungal genomes, along with recent advances in whole genome sequencing, have facilitated the establishment of multiple yeast genome sequences [26–29]. From an evolutionary perspective, the short generation time of yeasts combined with the strong environmental selective pressures to which they are exposed facilitate the detection of recent selection events in these organisms. Indeed, different budding yeast species display a surprisingly high level of genome diversity that is comparable to that observed within the family of chordates . The Saccharomyces Genome Resequencing Project has resulted in genomic sequences of multiple strains of S. cerevisiae and its close relative, Saccharomyces paradoxus . Studying polymorphism and divergence between the genomes of S. cerevisiae and S. paradoxus strains thus provides an excellent opportunity to identify genes or genetic regions likely to be under positive Darwinian selection.
In this study, we performed genome-wide analyses of SNPs identified in the Saccharomyces Genome Resequencing Project that lie within protein coding genes and used phylogenetic and population genetic methods to detect evidence of selection acting either on entire protein-coding genes or on individual codon sites within genes. Interestingly, we found a stronger association of both genes and codons under positive selection with intrinsically disordered protein regions compared to regions of regular secondary or tertiary structure. Furthermore, a higher degree of positive selection was found to act on proteins belonging to different functional and structural protein categories that are characterized by a high average IDR content. The biological significance of these findings is discussed in the context of the structure, function and evolvability of proteins.
The frequency of codon sites under positive selection is enhanced in protein regions with intrinsically disordered structure
To independently test whether the observed frequency differences were greater than would be expected by chance, a randomization test was performed. Briefly, the test entailed sampling a number of selected sites, equivalent to the number of sites found for each of the three conformational states individually, from the combined set of selected sites. The number of sites under either positive or negative selection in each such sample was then calculated. The procedure was repeated 10,000 times to obtain an empirical distribution of the number of selected sites expected by chance. The null hypothesis that the actual number of sites under selection for each conformational state belonged to the derived distributions of selected sites was assessed by a t-test. The results showed a significant (P ≤ 0.001) difference between the observed frequencies of selected sites in different conformational states and the empirically generated random distributions in all cases except in the case of negatively selected sites in α-helical regions. Figure 2d (left panel) shows the derived distributions from each randomization test along with the observed number of positively and negatively selected sites (downward-pointing arrowheads) for IDRs. The figure provides independent support for a strong enrichment of positively selected sites in IDRs and a small but significant depletion of negatively selected sites in these regions. The relative difference between the number of observed (downward-pointing arrowheads) and expected (upward-pointing arrowheads) sites under selection was much greater for positively than for negatively selected sites, as shown by the ratio of the two values (top right corner in each panel). The enrichment level for positively selected sites in IDRs is almost ten-fold higher than the under-representation level of negatively selected sites in the same regions. Hence, the distribution was considerably less skewed for negatively selected sites. The trend was exactly the opposite for regions with α-helical (right panels) and β-sheet (middle panels) conformation. Positively selected sites are under-represented in these regions. Again the extent of positive site under-representation is much greater than the deviation level for negative sites, which differ little, if at all, from the empirically generated value expected for a random distribution within the α-helical and β-sheet conformational classes. Based on the proteome-wide analysis of codons under selection, we thus concluded that there is a strong bias in the distribution of positively selected sites between gene regions encoding regular and disordered protein structure.
We next investigated whether a similar bias in the distribution of codons under selection could be observed at the level of intact genes. To this end, a non-overlapping sliding window of 25 codons was moved across each aligned gene in the analyzed data set, and the number of positively selected codon sites within each window was counted. The predicted IDR content within each window was also calculated. Each window containing at least one positive site thus generated a data point and for genes resulting in at least five such data points the correlation between IDR content and the number of codons under positive selection was assessed by calculation of Spearman's rank correlation coefficient (P ≤ 0.05). Again, the correlation between degree of disorder and incidence of positive selection was obvious. For the genes analyzed, a significant positive correlation between IDR content and positively selected codon sites was observed in 528 genes, whereas a significant negative correlation was found in only 28 genes. These results thus suggest that the correlation between positively selected sites and gene regions encoding IDRs can be extended to the level of intact genes and proteins.
Intrinsically disordered protein regions have a higher proportion of fixed non-synonymous polymorphisms
A Mann-Whitney U test was also performed in order to independently test the significance of the correlation between FI values and IDR content. Genes were sorted into two equally sized groups according to the level of their FI value (the median FI value was 0.42 after removal of outliers). The null hypothesis of equal secondary structure content in the resulting data sets was then tested. There was a significantly higher IDR content in the dataset containing higher FI values (P ≤ 10-15). No significant difference in FI or IDR content (P > 0.5) was found between subsets when the dataset was divided in the same way into subsets of high and low (G+C) content (the median G+C value was 0.42). Thus, we conclude that there is a higher proportion of fixed non-synonymous polymorphism in IDRs than in other protein regions, again suggesting an enhanced level of positive selection in these regions.
A potential problem with the analyses presented above is the fact that most genes did not obtain a statistically significant FI value at the chosen level of significance, and hence were discarded from the analysis. To assure that this did not prejudice the overall conclusion, we performed an alternative, proteome-wide analysis. Three composite alignments were created by concatenating protein regions from all 3,746 aligned genes that are predicted to be α-helix, β-strand or IDR. The overall FI was then calculated for each of the three concatenated alignments. Figure 3e shows the resulting overall FI for each composite alignment. In accordance with our previous observations, the overall FI value was close to 1.0 in the IDRs, indicating an overall balance between positive and negative selection acting within these regions. These results were very similar whether a strict or a liberal confidence value was used in the IDR predictions (see Materials and methods). In protein regions with regular secondary structure, the overall FI value was lower than 1.0, indicating an overall bias towards purifying selection acting on these regions. Thus, the data support enhanced positive selection in IDRs even when data from all the gene alignments are studied.
Finally, as an independent assessment of the distribution bias of positively selected polymorphic sites within genes, a non-overlapping window of 25 codons was moved over all the gene alignments, and a regional FI was calculated within each such window. The correlation between the resulting FI and IDR content was estimated by Spearman's rank correlation coefficient. The number of genes with a positive correlation between intrinsic disorder and FI (329 genes) was about an order of magnitude higher than the number of genes where a negative correlation was observed (39 genes), again suggesting a positive correlation between intrinsic disorder and degree of positive selection within proteins.
Intrinsically disordered regions are not depleted in functional sites
Positively selected sites are over-represented in a subset of functional protein categories
Protein categories with a high propensity for positive selection have a high average IDR content
Here we show evidence for association between positive adaptive selection and regions of proteins with a low intrinsic propensity for secondary structure formation. This conclusion is based on the study of how genetic variation within 64 strains of S. cerevisiae and S. paradoxus affects the amino acid sequence of about two-thirds of the proteins within the yeast proteome. Since we cannot reconstruct the evolutionary history of these strains, it is relevant to discuss issues that influence the robustness of our conclusions.
Firstly, we have addressed whether the conclusions we draw could be influenced by the selection of gene alignments for study since we have not studied all genes. Genes were mainly excluded from the study based on uncertainty of the alignments. For the analysis shown, we required a level of 70% amino acid identity in proteins translated from the aligned genes. Reducing this threshold to 60% did not increase the number of proteins appreciably, probably because many of the low quality alignments result from incomplete genome sequences for one or more of the strains. An increase of the threshold to 80% identity, however, led to the exclusion of a further 800 gene alignments. Importantly, the use of these different thresholds for selection of gene alignments for study did not significantly influence the conclusions drawn.
Secondly, we have used different approaches to identify evidence of natural selection since each individual method may be subject to potential drawbacks. While the accuracy of maximum likelihood methods for identifying codons under selection has been questioned recently [35, 36], the McDonald-Kreitman approach is an insensitive method for detecting positive selection because evidence of positive selection is often cancelled out by negative selection, which is much more common. Indeed, the recent study by Liti et al.  did not find any statistical support for the existence of individual genes under positive selection when McDonald-Kreitman data were corrected for random effects associated with multiple testing. We have not corrected the data in our analysis since the aim was to study the overall association of protein structure with propensity for positive or negative selection rather than to identify individual genes under selection. The fact that we identify evidence for similar patterns of positive and negative selection at the level of codons using the FEL method and at the level of intact genes or gene regions using the McDonald-Kreitman test strongly supports the conclusion that the propensity for positive selection is enhanced in the IDRs of proteins. Nowaza et al.  have pointed to the utility of correlating bioinformatic predictions of codon sites under positive selection with biochemical data. Our observation that predicted evidence of positive selection tends to correlate with IDRs in proteins will be a useful parameter to test in other systems.
Thirdly, we have used several alternative strategies and statistical tests, including permutation tests of empirical significance levels, to assess the significance of the associations we have observed in the different tests for positive and negative selection. In all cases these tests provide statistical support for the association between positive selection and IDRs in proteins.
Fourthly, we have used alternative approaches to study the possibility that the increased frequency of positively selected residues in IDRs is the result of reduced negative selection in these regions due to the fact that they might be less important for protein function. This hypothesis would fit well with preconceptions about protein structure that have stated that structured conformation correlates with functional significance. Most importantly, however, this explanation of our results is contradicted by our data, since the frequency of negatively selected codons is not significantly reduced in IDRs relative to protein regions with a structured conformation. Consistently, other recent reports also suggest that IDRs are under negative selection at a level that may even exceed the level for secondary structure elements [31, 32]. To further address the issue, we used the Limacs method to independently predict the relative frequency of functionally import amino acid residues in IDRs in relation to regions of structured conformation. The results are consistent with our codon selection data and show that the predicted frequency of functionally important residues is similar in IDRs and regions with structured conformation.
Other approaches to assess the robustness of our conclusion that the IDRs in proteins are particularly susceptible to positive selection are to test whether protein classes predicted to have high adaptive potential make biological sense, whether they are generally characterized by proteins with high IDR content, as well as whether such proteins are associated with molecular mechanisms that could explain their higher adaptive potential. To test whether particular protein categories are enriched in proteins predicted to be under positive or negative selection, we used the FunCat and ProteinCat catalogues of yeast proteins. Several FunCat categories showed significant over-representation of proteins predicted to be under positive selection. These included functions associated with development, mating, and morphogenesis that contain known targets for adaptive selection. Other categories have to do with virulence, defense and cell signaling as well as many categories related to DNA functions, including transcription. Most of these categories contain many proteins that are potentially relevant targets for adaptive mutation. Fewer FunCat categories showed significant evidence of negative selection but these include classes containing highly conserved proteins involved in detoxification, fermentation and protein folding. The ProteinCat categories that are significantly enriched in proteins predicted to be under negative or positive selection also make sense. Most of the categories under negative selection contain enzymes, which are known to be high in structured regions under negative selection. The three categories enriched in proteins predicted to be under positive selection are all categories containing transcription factors. Individual transcription factors have been suggested to be under positive selection previously [37–39]. Furthermore, transcription factors have been shown to evolve faster than other protein classes in yeast . As predicted by our model, FunCat and ProteinCat categories that are over-represented in proteins predicted to be under positive selection also have a high average IDR content while the reverse is true for categories associated with negative selection.
Our data suggest that the conformational flexibility of IDRs, which might potentially translate to a functional flexibility, could represent a generally evolvable characteristic. IDRs might represent a conformational ground state that provides proteins with an intrinsic ability to adapt new functionality. According to this view, protein regions with regular structure would tend to favor structural and functional specialization but at a cost in terms of evolvability. Consistent with this, experimental studies suggest that naturally occurring proteins are not maximally stable, but rather that they seem to exhibit the minimal level of stability necessary for the environment in which they function . Furthermore, in silico studies have shown that strong selection for structural stability would be expected to lead to reduced evolution of novel protein functions [42, 43].
A key question is thus whether there is evidence to support a link between the conformational flexibility of IDRs and functional flexibility? The widespread involvement of IDRs in interactions between protein partners involved in a diverse range of biological functions provides such evidence . Further, IDRs have, as predicted, been shown to adopt different conformations upon interaction with different binding partners . It has long been recognized that alterations affecting gene regulation provide a powerful opportunity for evolution of phenotypic differences between organisms and hence for the adaptation of organisms to new environments . Much attention has focused on studies of adaptive changes that affect the sequence of cis-acting regulatory elements in gene promoters, enhancers or silencers [46, 47]. However, recent studies have suggested that mutations in trans-acting components, including transcription factors and co-regulators, are also important for the evolutionary adaptation of transcription networks [48–50]. Protein interactions involving transcriptional components have been suggested to play a role in such evolutionary processes [46, 51]. Several studies have shown evidence of adaptive changes in the protein interaction domains of transcription factor proteins [37–39] and previous computational studies have independently shown that transcription factors have a high IDR content [20, 52]. Interestingly, transcription factor activation domains have been shown to be IDRs [53, 54] that interact with other proteins by two-step target-assisted folding mechanisms in which their intrinsically unstructured nature plays an important role [55, 56].
Taken together with previous knowledge, our results thus provide strong evidence for the involvement of IDRs in evolutionary adaptation. Such IDRs are sometimes associated with transcription factors, where they have been relatively well studied, but it is likely that IDRs involved in adaptation may be found in a much broader range of protein classes. The genome-wide nature of the study suggests that the conclusions are significant to most if not all of the proteome. The adaptive nature of IDRs gives a new perspective for understanding the potential adaptive significance of gene variants that arise in nature and medicine.
Materials and methods
Retrieval of genomic sequences and polymorphism information
Plain text files containing the S. cerevisiae and S. paradoxus reference genome sequences, genomic coordinates of identified SNPs in each of the sequenced isolates (37 S. cerevisiae strains and 27 S. paradoxus strains), and genomic coordinates of the protein coding genes in S. cerevisiae were downloaded on 1 September 2007 from the Sanger ftp site . The strains studied are listed in Additional file 1. Only confirmed polymorphisms were used in the subsequent analyses. Details of synonymous and non-synonymous SNPs in S. cerevisiae and S. paradoxus strains are described per protein-coding sequence and per chromosome in Additional files 2, 3, 4, 5, 6 and 7.
Retrieval of protein coding genes in S. cerevisiae and S. paradoxus
The coding sequences of all annotated S. cerevisiae protein coding genes in the Saccharomyces Genome Database  were extracted from the reference genome sequence, and reverse complemented for genes where transcription occurs from the lower strand. For S. paradoxus, we retrieved the genomic coordinates of genes inferred previously based on synteny and sequence similarity of predicted ORFs in the S. paradoxus genome to annotated genes in the S. cerevisiae genome . The corresponding coding sequences were extracted from the S. paradoxus reference genome sequence and tested for their coding potential in six possible ORFs using the sixpack method of the EMBOSS sequence analysis package . When necessary, the coding sequence was reverse complemented and/or shifted to yield a translatable ORF. No mitochondrial genes were included in the analysis.
Alignment of orthologous genes
For each S. cerevisiae gene where a S. paradoxus orthologue could be inferred, a multiple sequence alignment was created consisting of all strain orthologues for which at least one sequence contained at least one SNP relative to the respective reference genome sequence. To assure that the alignment did not result in any frameshifts, translation-assisted alignments were created using DIALIGN-T . To ensure that subsequent analyses were not affected by the occurrence of uncertain alignments, orthologous protein coding gene alignments were filtered at different stringency thresholds based on the level of sequence identity in the alignments of the translated sequences and alignments below the filtering threshold were removed (60%, 4,029 alignments; 70%, 4,001 alignments; 80%, 3,198 alignments). Subsequent analyses were performed on each of the three datasets to determine whether the choice of filtering threshold altered the conclusions drawn from subsequent analyses. The choice of filtering criteria did not significantly influence subsequent analysis and results using the alignments with ≥70% are shown in the paper. Details of the DNA sequence alignments used in the study are available on request. The methods used to detect selection assume that all sites in each gene share the same phylogeny  and therefore alignments where recombination events were predicted by the GENECONV method  using the '/r' option (only silent sites analyzed), and calculating global P-values based on Bonferroni-corrected Karlin-Altschul P-values were removed from the subsequent analysis.
Prediction of structured and intrinsically disordered protein regions
The PSIPRED  and VSL2  methods were used to predict the occurrence of structured and disordered protein regions, respectively, in all S. cerevisiae proteins. The protein-coding DNA sequences used as well as the protein sequences translated from them are provided in Additional files 8 and 9, respectively. The VSL2 method was among the best performing in the CASP7 assessment of IDR prediction algorithms , and performs particularly well in predicting short disordered regions . Both methods rely on evolutionary information derived from PSI-BLAST generated profiles. For the PSI-BLAST searches we used a filtered version of the Uniref90 database (release 12.8) where transmembrane regions, coiled-coil regions and low-complexity regions had been removed using the pfilt program . Each PSI-BLAST search was performed for three iterations, with an E-value threshold of 0.001 for inclusion in a multi-pass model, against the reduced Uniref90 database. A position-specific scoring matrix was produced (option -Q) and this was used as input to the PSIPRED and VSL2 algorithms. Amino acid residues predicted by PSIPRED to belong to state 'helix' or 'extended beta' with a confidence value equal to or larger than 8 were chosen for subsequent analysis of regular secondary structures (Additional files 10 and 11). For prediction of disordered residues, we used both a strict confidence value threshold of 0.8 and a liberal threshold value of 0.5 (Additional files 12 and 13). Any residue sites for which predictions of disordered and regular structure overlapped were removed. Except where stated, the strict confidence value (0.8) was used for IDR prediction in the data shown in the paper. The mean fraction of residues reliably predicted to be in α-helical, β-strand, and intrinsically disordered conformation was 26%, 6% and 23%, respectively. Since the sequences studied using these selection criteria represent only 55% of amino acid residues, all analyses were also performed using the liberal confidence threshold (0.5) for IDR detection (44% of residues identified as IDR, 76% of residues included in analyses). None of the overall conclusions were affected by use of reduced-stringency IDR prediction criteria (0.5). We obtained 1,191 protein regions mapping to known structured domains in the protein database (PDB) and corresponding to 643 yeast proteins from the PFAM database (version 25.0) .
Phylogenetic test for selection
Amino acid residues under selection in inter-strain/species alignments were identified using a codon-based maximum likelihood method implemented in the HyPhy software package . Each alignment was analyzed separately for codon sites under selection using the FEL method. A neighbor-joining tree was built for each alignment under the Tajima-Nei model of nucleotide substitution , and the tree topology along with the alignment were used as input. The HKY85 model of nucleotide substitution was used . In the FEL method, a likelihood ratio test is performed to estimate the rates of synonymous (α) and non-synonymous (β) substitution at each codon site. If the synonymous substitution rate is higher than the non-synonymous rate (α > β), this is indicative of negative selection, whereas a higher non-synonymous substitution rate (β > α) indicates the action of positive selection. The default threshold value of P ≤ 0.1 was used to reject the null-hypothesis of α = β at a codon site.
Population genetic test for selection
To detect genes under selection, the multiple sequence alignments of all orthologous protein coding genes were subjected to the McDonald-Kreitman test  as implemented in the MKtest program of the libsequence package . This test investigates the correlation of polymorphisms within species and their divergence between species, and also distinguishes between synonymous and non-synonymous sites. In a sequence having no evolutionary constraints, the ratio of non-synonymous and synonymous sites that are fixed between species (dN/dS) should be roughly equal to the ratio of non-synonymous and synonymous sites that are polymorphic within a species (pN/pS), according to the neutral theory of evolution . We refer to the ratio (dN/dS)/(pN/pS) as the fixation index (FI). When negative selection is acting on a locus, non-synonymous mutations are unlikely to become fixed, although they might still contribute to polymorphism within a species. Thus, the ratio (dN/dS) is expected to be lower than the ratio (pN/pS), yielding an FI <1. However, if positive selection is acting on a locus, non-synonymous mutations are expected to spread rapidly through the population, thus having a greater effect on divergence than on polymorphism. In this case, the ratio (dN/dS) is expected to be higher than the ratio (pN/pS), yielding an FI >1. We calculated the FI for each gene and performed a Fisher's exact test of the null hypothesis of independence between the two ratios (dN/dS) and (pN/pS). Rejection of the null hypothesis at the 5% significance level was taken as an indication of either negative or positive selection, depending on the FI value (Additional file 14). The occurrence of slightly deleterious mutations is known to cause an underestimation of the level of adaptive evolution , and a frequently used approach to control for some of the effects of these mutations is to remove polymorphisms segregating at low frequencies. Thus, all SNPs occurring in less than 15% of the strains within a species were removed, as this has been demonstrated to be an appropriate threshold . Additionally, average proteome McDonald-Kreitman tests were performed by merging all aligned codons encoding amino acid residues reliably predicted to be in regions of α-helical, β-sheet or intrinsically disordered conformation, and performing the calculations described above on each of the three resulting composite alignments (Additional file 15). The (G+C) content of each S. cerevisiae gene was calculated using the geecee program of the EMBOSS package.
Prediction and analysis of functionally important amino acid residues
We applied the Limacs method [76, 77] to predict the occurrence of functionally important sites in the amino acid sequence of translated S. cerevisiae genes. Given a multiple sequence alignment, Limacs uses a template library for prediction of functionally important sites in the alignment. Since the method is based on known functional sites in conserved functional domains , we constrained the analysis to mapped Pfam domains . All annotated Pfam domains were mapped to the translated yeast genes using reverse position-specific BLAST (RPS-BLAST). Domains that mapped to at least one of the yeast proteins were subjected to analysis by Limacs to predict the location of functional sites. To score positive as a functional site, sites were required to have a query column versus template pattern score (QT score) of at least 0.95, a QT Z-score of at least 1, and randomization scores QRn and TRn lower than 0.01. The distribution of predicted functional sites in mapped Pfam domains was analyzed for residues in regions of intrinsically disordered and regular conformation, and differences in the distribution were assessed by a χ2 test, where a 2 × 2 contingency table of functional sites in IDRs (LI) and non-IDRs (LnI), and of non-functional sites in IDRs (nLI) and in non-IDRs (nLnI) was built. The Limacs functional sites index (LI/nLI)/(LnI/nLnI) was used to indicate whether there was a relative abundance (index above one) or depletion (index below one) of predicted functional sites in IDRs compared to non-IDRs. Predicted functional sites in each gene are listed in Additional file 16. To ensure the robustness of the obtained results, the comparison was repeated using various IDR prediction reliability cutoff values. Furthermore, to avoid bias from over-represented domains, a conservative filtering procedure was also applied, in which only one of the mapped protein regions was analyzed in cases where a domain mapped to more than one protein.
Assessment of differences in degree of selection between protein regions of regular and intrinsically disordered structure
Spearman's rank correlation coefficient (r S ) was calculated to assess the correlation between secondary structure content and FI in genes where the McDonald-Kreitman test rejected the null hypothesis of neutral evolution. In the same manner, the correlation was assessed between (G+C) content and secondary structure content, and between (G+C) content and FI. Because of the large sample size (n), a normal distribution was assumed and the statistical significance was determined by calculating z = rS√n - 1.
The intra-genic correlation between secondary structure content and FI was assessed for each gene by a sliding-window analysis, where structure content and FI were calculated within non-overlapping windows of size 25 codons for all aligned orthologous genes. Spearman's rank correlation coefficient was calculated for the resulting data points, indicating the correlation of structure content and FI for each individual gene. Only informative windows were analyzed, that is, regions for which a FI value could be calculated. If the number of informative windows from a gene alignment was more than 30, a normal distribution was assumed and an approximate z-value was calculated as above. In cases where the number of informative windows was fewer than 30 but more than 5, significance was assessed by consulting a table of pre-calculated critical r S values. If the analysis produced five or fewer informative windows, no correlation analysis was performed on the gene.
The distribution of codon sites predicted to be under positive or negative selection was likewise assessed between the IDR, α-helix and β-sheet conformational states. A series of 2 × 2 contingency tables was generated describing the frequencies of two types of codon sites in two different structure states, and for each table the difference in distribution was assessed by a χ2 test. To independently investigate whether an observed difference in the distribution of a given type of codon in a certain conformational state could be expected by chance, a randomization procedure was applied. The number of codon sites, n, of a given type C in the investigated structure state S was counted and summed over all genes, as was the total number of codon sites, N, in conformational state S. The randomization test entailed performing 10,000 re-samples, where N codon sites were randomly chosen from any gene, and counting the resulting number of sites, n', of type C. The observed number of type C sites, n, in intrinsically disordered regions was then compared to the simulated distribution of n' values, and the null hypothesis of equal distribution in disordered regions and regions with secondary structure was rejected by a two-tailed t-test (P ≤ 0.001).
The intra-genic correlation between occurrence of type C sites and secondary structure content was assessed for each gene by a sliding-window analysis, where non-overlapping windows of size 25 codons were moved over the gene, counting the number of type C sites and calculating the secondary structure content in each window. Spearman's rank correlation coefficient was calculated to assess the correlation between secondary structure content and type C codon site distribution, and the statistical significance was estimated as described above, discarding windows with no type C codons.
Assessment of differences in degree of selection between functional categories of proteins
We adopted the MIPS FunCat and ProteinCat annotation schemes  to assign each gene into one or more functional categories. Excess or depletion of codon sites under positive selection in a given functional category was assessed by a randomization test. In a category with a sum of N codon sites over all constituent genes, the observed number of sites under selection, n, was compared to the empirical distribution of the number of selected sites, which was derived by choosing N codon sites from random genes in any functional category and counting the number of sites under selection, n', then repeating this process 10,000 times. A two-tailed t-test was performed to estimate if the observed number of sites under positive selection in a certain category deviated significantly from the random expectation.
fixed effects likelihood
fixation index (calculated in the McDonald-Kreitman test)
intrinsically disordered region
Munich Information Center for Protein Sequences
open reading frame
single nucleotide polymorphism.
The authors thank colleagues at Södertörn University and the Karolinska Institute for helpful discussions. The work was supported by a grant from the Baltic Sea Foundation and AW is also supported by grants from the Swedish Research Council and the Swedish Cancer Society.
- Kimura M: Evolutionary rate at the molecular level. Nature. 1968, 217: 624-626. 10.1038/217624a0.PubMedView ArticleGoogle Scholar
- MacCallum C, Hill E: Being positive about selection. PLoS Biol. 2006, 4: e87-10.1371/journal.pbio.0040087.PubMedPubMed CentralView ArticleGoogle Scholar
- Sabeti P, Reich D, Higgins J, Levine H, Richter D, Schaffner S, Gabriel S, Platko J, Patterson N, McDonald G, Ackerman H, Campbell S, Altshuler D, Cooper R, Kwiatkowski D, Ward R, Lander E: Detecting recent positive selection in the human genome from haplotype structure. Nature. 2002, 419: 832-837. 10.1038/nature01140.PubMedView ArticleGoogle Scholar
- Tang K, Thornton K, Stoneking M: A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol. 2007, 5: e171-10.1371/journal.pbio.0050171.PubMedPubMed CentralView ArticleGoogle Scholar
- Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007, 24: 1586-1591. 10.1093/molbev/msm088.PubMedView ArticleGoogle Scholar
- Chen S, Hung C, Xu J, Reigstad C, Magrini V, Sabo A, Blasiar D, Bieri T, Meyer R, Ozersky P, Armstrong J, Fulton R, Latreille J, Spieth J, Hooton T, Mardis E, Hultgren S, Gordon J: Identification of genes subject to positive selection in uropathogenic strains of Escherichia coli: a comparative genomics approach. Proc Natl Acad Sci USA. 2006, 103: 5977-5982. 10.1073/pnas.0600938103.PubMedPubMed CentralView ArticleGoogle Scholar
- Nielsen R, Bustamante C, Clark A, Glanowski S, Sackton T, Hubisz M, Fledel-Alon A, Tanenbaum D, Civello D, White T, Sninsky J, Adams M, Cargill M: A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biol. 2005, 3: e170-10.1371/journal.pbio.0030170.PubMedPubMed CentralView ArticleGoogle Scholar
- Petersen L, Bollback J, Dimmic M, Hubisz M, Nielsen R: Genes under positive selection in Escherichia coli. Genome Res. 2007, 17: 1336-1343. 10.1101/gr.6254707.PubMedPubMed CentralView ArticleGoogle Scholar
- Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002, 30: 3894-3900. 10.1093/nar/gkf493.PubMedPubMed CentralView ArticleGoogle Scholar
- The International HapMap Consortium: The International HapMap Project. Nature. 2003, 426: 789-796. 10.1038/nature02168.View ArticleGoogle Scholar
- Kirschner M, Gerhart J: Evolvability. Proc Natl Acad Sci USA. 1998, 95: 8420-8427. 10.1073/pnas.95.15.8420.PubMedPubMed CentralView ArticleGoogle Scholar
- Pigliucci M: Is evolvability evolvable?. Nat Rev Genet. 2008, 9: 75-82. 10.1038/nrg2278.PubMedView ArticleGoogle Scholar
- Pigliucci M: Do we need an extended evolutionary synthesis?. Evol Int J Org Evol. 2007, 61: 2743-2749.View ArticleGoogle Scholar
- Cowen L, Lindquist S: Hsp90 potentiates the rapid evolution of new traits: drug resistance in diverse fungi. Science. 2005, 309: 2185-2189. 10.1126/science.1118370.PubMedView ArticleGoogle Scholar
- Parter M, Kashtan N, Alon U: Facilitated variation: how evolution learns from past environments to generalize to new environments. PLoS Comput Biol. 2008, 4: e1000206-10.1371/journal.pcbi.1000206.PubMedPubMed CentralView ArticleGoogle Scholar
- Wagner A: Robustness, evolvability, and neutrality. FEBS Lett. 2005, 579: 1772-1778. 10.1016/j.febslet.2005.01.063.PubMedView ArticleGoogle Scholar
- Wagner A: Robustness and evolvability: a paradox resolved. Proc Biol Sci. 2008, 275: 91-100. 10.1098/rspb.2007.1137.PubMedPubMed CentralView ArticleGoogle Scholar
- Dunker A, Oldfield C, Meng J, Romero P, Yang J, Chen J, Vacic V, Obradovic Z, Uversky V: The unfoldomics decade: an update on intrinsically disordered proteins. BMC Genomics. 2008, 9 (Suppl 2): S1-10.1186/1471-2164-9-S2-S1.View ArticleGoogle Scholar
- Fink A: Natively unfolded proteins. Curr Opin Struct Biol. 2005, 15: 35-41. 10.1016/j.sbi.2005.01.002.PubMedView ArticleGoogle Scholar
- Lobley A, Swindells M, Orengo C, Jones D: Inferring function using patterns of native disorder in proteins. PLoS Comput Biol. 2007, 3: e162-10.1371/journal.pcbi.0030162.PubMedPubMed CentralView ArticleGoogle Scholar
- Wright PE, Dyson HJ: Linking folding and binding. Curr Opin Struct Biol. 2009, 19: 31-38. 10.1016/j.sbi.2008.12.003.PubMedPubMed CentralView ArticleGoogle Scholar
- Tompa P, Dosztanyi Z, Simon I: Prevalent structural disorder in E. coli and S. cerevisiae proteomes. J Proteome Res. 2006, 5: 1996-2000. 10.1021/pr0600881.PubMedView ArticleGoogle Scholar
- Bogatyreva NS, Finkelstein AV, Galzitskaya OV: Trend of amino acid composition of proteins of different taxa. J Bioinformatics Comput Biol. 2006, 4: 597-608. 10.1142/S0219720006002016.View ArticleGoogle Scholar
- Goffeau A, Barrell B, Bussey H, Davis R, Dujon B, Feldmann H, Galibert F, Hoheisel J, Jacq C, Johnston M, Louis E, Mewes H, Murakami Y, Philippsen P, Tettelin H, Oliver S: Life with 6000 genes. Science. 1996, 274: 546-10.1126/science.274.5287.546.PubMedView ArticleGoogle Scholar
- Ghaemmaghami S, Huh W, Bower K, Howson R, Belle A, Dephoure N, O'Shea E, Weissman J: Global analysis of protein expression in yeast. Nature. 2003, 425: 737-741. 10.1038/nature02046.PubMedView ArticleGoogle Scholar
- Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen B, Johnston M: Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science. 2003, 301: 71-76. 10.1126/science.1084337.PubMedView ArticleGoogle Scholar
- Dujon B, Sherman D, Fischer G, Durrens P, Casaregola S, Lafontaine I, Montigny JD, Marck C, Neuvéglise C, Talla E, Goffard N, Frangeul L, Aigle M, Anthouard V, Babour A, Barbe V, Barnay S, Blanchin S, Beckerich J, Beyne E, Bleykasten C, Boisramé A, Boyer J, Cattolico L, Confanioleri F, Daruvar AD, Despons L, Fabre E, Fairhead C, Ferry-Dumazet H, et al: Genome evolution in yeasts. Nature. 2004, 430: 35-44. 10.1038/nature02579.PubMedView ArticleGoogle Scholar
- Kellis M, Patterson N, Endrizzi M, Birren B, Lander E: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003, 423: 241-254. 10.1038/nature01644.PubMedView ArticleGoogle Scholar
- Rossignol T, Lechat P, Cuomo C, Zeng Q, Moszer I, d'Enfert C: CandidaDB: a multi-genome database for Candida species and related Saccharomycotina. Nucleic Acids Res. 2008, D557-561. 36 Database
- Liti G, Carter D, Moses A, Warringer J, Parts L, James S, Davey R, Roberts I, Burt A, Koufopanou V, Tsai I, Bergman C, Bensasson D, O'Kelly M, Oudenaarden Av, Barton D, Bailes E, Nguyen A, Jones M, Quail M, Goodhead I, Sims S, Smith F, Blomberg A, Durbin R, Louis E: Population genomics of domestic and wild yeasts. Nature. 2009, 458: 337-341. 10.1038/nature07743.PubMedPubMed CentralView ArticleGoogle Scholar
- Schaefer C, Schlessinger A, Rost B: Protein secondary structure appears to be robust under in silico evolution while protein disorder appears not to be. Bioinformatics. 2010, 26: 625-631. 10.1093/bioinformatics/btq012.PubMedPubMed CentralView ArticleGoogle Scholar
- Tompa P, Kalmar L: Power law distribution defines structural disorder as a structural element directly linked with function. J Mol Biol. 2010, 403: 346-350. 10.1016/j.jmb.2010.07.044.PubMedView ArticleGoogle Scholar
- Chen J, Romero P, Uversky V, Dunker A: Conservation of intrinsic disorder in protein domains and families: I. A database of conserved predicted disordered regions. J Proteome Res. 2006, 5: 879-887. 10.1021/pr060048x.PubMedPubMed CentralView ArticleGoogle Scholar
- Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Güldener ITU, Mannhaupt G, Münsterkötter M, Mewes H: The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res. 2004, 32: 5539-5545. 10.1093/nar/gkh894.PubMedPubMed CentralView ArticleGoogle Scholar
- Mayrose I, Doron-Faigenboim A, Bacharach E, Pupko T: Towards realistic codon models: among site variability and dependency of synonymous and non-synonymous rates. Bioinformatics. 2007, 23: i319-327. 10.1093/bioinformatics/btm176.PubMedView ArticleGoogle Scholar
- Nozawa M, Suzuki Y, Nei M: Reliabilities of identifying positive selection by the branch-site and the site-prediction methods. Proc Natl Acad Sci USA. 2009, 106: 6700-6705. 10.1073/pnas.0901855106.PubMedPubMed CentralView ArticleGoogle Scholar
- Luo C, Lu X, Stubbs L, Kim J: Rapid evolution of a recently retroposed transcription factor YY2 in mammalian genomes. Genomics. 2006, 87: 348-355. 10.1016/j.ygeno.2005.11.001.PubMedView ArticleGoogle Scholar
- Maiti S, Doskow J, Sutton K, Nhim R, Lawlor D, Levan K, Lindsey J, Wilkinson M: The Pem homeobox gene: rapid evolution of the homeodomain, X chromosomal localization, and expression in reproductive tissue. Genomics. 1996, 34: 304-316. 10.1006/geno.1996.0291.PubMedView ArticleGoogle Scholar
- Zhang J, Webb D, Podlaha O: Accelerated protein evolution and origins of human-specific features: Foxp2 as an example. Genetics. 2002, 162: 1825-1835.PubMedPubMed CentralGoogle Scholar
- Beskow A, Wright AP: Comparative analysis of regulatory transcription factors in Schizosaccharomyces pombe and budding yeasts. Yeast. 2006, 23: 929-935. 10.1002/yea.1413.PubMedView ArticleGoogle Scholar
- Arnold F, Wintrode P, Miyazaki K, Gershenson A: How enzymes adapt: lessons from directed evolution. Trends Biochem Sci. 2001, 26: 100-106. 10.1016/S0968-0004(00)01755-2.PubMedView ArticleGoogle Scholar
- Bloom J, Wilke C, Arnold F, Adami C: Stability and the evolvability of function in a model protein. Biophys J. 2004, 86: 2758-2764. 10.1016/S0006-3495(04)74329-5.PubMedPubMed CentralView ArticleGoogle Scholar
- Basu M, Carmel L, Rogozin I, Koonin E: Evolution of protein domain promiscuity in eukaryotes. Genome Res. 2008, 18: 449-461. 10.1101/gr.6943508.PubMedPubMed CentralView ArticleGoogle Scholar
- Shimizu K, Toh H: Interaction between intrinsically disordered proteins frequently occurs in a human protein-protein interaction network. J Mol Biol. 2009, 392: 1253-1265. 10.1016/j.jmb.2009.07.088.PubMedView ArticleGoogle Scholar
- King M, Wilson A: Evolution at two levels in humans and chimpanzees. Science. 1975, 188: 107-116. 10.1126/science.1090005.PubMedView ArticleGoogle Scholar
- Hsia C, McGinnis W: Evolution of transcription factor function. Curr Opin Genet Dev. 2003, 13: 199-206. 10.1016/S0959-437X(03)00017-0.PubMedView ArticleGoogle Scholar
- Tirosh I, Barkai N, Verstrepen KJ: Promoter architecture and the evolvability of gene expression. J Biol. 2009, 8: 95-10.1186/jbiol204.PubMedPubMed CentralView ArticleGoogle Scholar
- Choi JK, Kim YJ: Epigenetic regulation and the variability of gene expression. Nat Genet. 2008, 40: 141-147. 10.1038/ng.2007.58.PubMedView ArticleGoogle Scholar
- Tirosh I, Reikhav S, Levy AA, Barkai N: A yeast hybrid provides insight into the evolution of gene expression regulation. Science. 2009, 324: 659-662. 10.1126/science.1169766.PubMedView ArticleGoogle Scholar
- Wittkopp PJ, Haerum BK, Clark AG: Evolutionary changes in cis and trans gene regulation. Nature. 2004, 430: 85-88. 10.1038/nature02698.PubMedView ArticleGoogle Scholar
- Lynch V, Wagner G: Resurrecting the role of transcription factor change in developmental evolution. Evolution. 2008, 62: 2131-2154. 10.1111/j.1558-5646.2008.00440.x.PubMedView ArticleGoogle Scholar
- Liu J, Narayanan B, Oldfield C, Su E, Uversky V, Dunker A: Intrinsic disorder in transcription factors. Biochemistry. 2006, 45: 6873-6888. 10.1021/bi0602718.PubMedPubMed CentralView ArticleGoogle Scholar
- McEwan IJ, Dahlman-Wright K, Ford J, Wright AP: Functional interaction of the c-Myc transactivation domain with the TATA binding protein: evidence for an induced fit model of transactivation domain folding. Biochemistry. 1996, 35: 9584-9593. 10.1021/bi960793v.PubMedView ArticleGoogle Scholar
- Radhakrishnan I, Perez-Alvarado GC, Parker D, Dyson HJ, Montminy MR, Wright PE: Solution structure of the KIX domain of CBP bound to the transactivation domain of CREB: a model for activator:coactivator interactions. Cell. 1997, 91: 741-752. 10.1016/S0092-8674(00)80463-8.PubMedView ArticleGoogle Scholar
- Ferreira ME, Hermann S, Prochasson P, Workman JL, Berndt KD, Wright AP: Mechanism of transcription factor recruitment by acidic activators. J Biol Chem. 2005, 280: 21779-21784. 10.1074/jbc.M502627200.PubMedView ArticleGoogle Scholar
- Hermann S, Berndt KD, Wright AP: How transcriptional activators bind target proteins. J Biol Chem. 2001, 276: 40127-40132. 10.1074/jbc.M103793200.PubMedView ArticleGoogle Scholar
- Sanger FTP. [ftp://ftp.sanger.ac.uk/pub/dmc/yeast]
- Saccharomyces Genome Database. [http://www.yeastgenome.org]
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000, 16: 276-277. 10.1016/S0168-9525(00)02024-2.PubMedView ArticleGoogle Scholar
- Subramanian A, Weyer-Menkhoff J, Kaufmann M, Morgenstern B: DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics. 2005, 6: 66-10.1186/1471-2105-6-66.PubMedPubMed CentralView ArticleGoogle Scholar
- Anisimova M, Nielsen R, Yang Z: Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. Genetics. 2003, 164: 1229-1236.PubMedPubMed CentralGoogle Scholar
- Sawyer S: Statistical tests for detecting gene conversion. Mol Biol Evol. 1989, 6: 526-538.PubMedGoogle Scholar
- Jones D: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999, 292: 195-202. 10.1006/jmbi.1999.3091.PubMedView ArticleGoogle Scholar
- Peng K, Radivojac P, Vucetic S, Dunker A, Obradovic Z: Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics. 2006, 7: 208-10.1186/1471-2105-7-208.PubMedPubMed CentralView ArticleGoogle Scholar
- Bordoli L, Kiefer F, Schwede T: Assessment of disorder predictions in CASP7. Proteins. 2007, 69 (Suppl 8): 129-136.PubMedView ArticleGoogle Scholar
- Jones D, Taylor W, Thornton J: A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry. 1994, 33: 3038-3049. 10.1021/bi00176a037.PubMedView ArticleGoogle Scholar
- PFAM database. [ftp://ftp.sanger.ac.uk//pub/databases/Pfam//releases/Pfam25.0/pdbmap.gz]
- Pond S, Frost S, Muse S: HyPhy: hypothesis testing using phylogenies. Bioinformatics. 2005, 21: 676-679. 10.1093/bioinformatics/bti079.PubMedView ArticleGoogle Scholar
- Tajima F, Nei M: Estimation of evolutionary distance between nucleotide sequences. Mol Biol Evol. 1984, 1: 269-285.PubMedGoogle Scholar
- Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985, 22: 160-174. 10.1007/BF02101694.PubMedView ArticleGoogle Scholar
- McDonald J, Kreitman M: Adaptive protein evolution at the Adh locus in Drosophila. Nature. 1991, 351: 652-654. 10.1038/351652a0.PubMedView ArticleGoogle Scholar
- Thornton T: Libsequence: a C++ class library for evolutionary genetic analysis. Bioinformatics. 2003, 19: 2325-2327. 10.1093/bioinformatics/btg316.PubMedView ArticleGoogle Scholar
- Kimura M: The Neutral Theory of Molecular Evolution. 1983, Cambridge: Cambridge Univeristy PressView ArticleGoogle Scholar
- Fay J, Wycoff G, Wu C-I: Positive and negative selection on the human genome. Genetics. 2001, 158: 1227-1234.PubMedPubMed CentralGoogle Scholar
- Charlesworth J, Eyre-Walker A: The McDonald-Kreitman test and slightly deleterious mutations. Mol Biol Evol. 2008, 25: 1007-1015. 10.1093/molbev/msn005.PubMedView ArticleGoogle Scholar
- Chakrabarti S, Lanczycki C: Analysis and prediction of functionally important sites in proteins. Protein Sci. 2007, 16: 4-13. 10.1110/ps.062506407.PubMedPubMed CentralView ArticleGoogle Scholar
- Lanczycki C, Chakrabarti S: A tool for the prediction of functionally important sites in proteins using a library of functional templates. Bioinformation. 2008, 2: 279-283.PubMedPubMed CentralView ArticleGoogle Scholar
- Finn R, Tate J, Mistry J, Coggill P, Sammut S, Hotz H, Ceric G, Forslund K, Eddy S, Sonnhammer E, Bateman A: The Pfam protein families database. Nucleic Acids Res. 2008, 36: D281-D288. 10.1093/nar/gkn226.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.