Progress towards mapping the universe of protein folds
© BioMed Central Ltd 2004
Published: 29 April 2004
Skip to main content
© BioMed Central Ltd 2004
Published: 29 April 2004
Although the precise aims differ between the various international structural genomics initiatives currently aiming to illuminate the universe of protein folds, many selectively target protein families for which the fold is unknown. How well can the current set of known protein families and folds be used to estimate the total number of folds in nature, and will structural genomics initiatives yield representatives for all the major protein families within a reasonable time scale?
In order to attempt predictions of the universe of protein folds - so-called fold space - we need to know how many protein families there are in nature and how many of these are likely to possess a novel fold. Genome sequencing still considerably outpaces the various structural genomics initiatives currently underway in the USA, Canada, Japan, Germany and the UK, with more than 160 completely sequenced genomes yielding about one million protein sequences at the start of 2004 . This contrasts with 24,000 entries of three-dimensional protein structures in the Protein Data Bank (PDB) [2, 3], some 500 of which were determined by structural genomics consortia over the last three years. Although this seems a daunting contrast, mounting evidence from the Gene3D (our unpublished data and ), SUPERFAMILY [5, 6], and Genomic Threading [7, 8] databases suggests that a relatively small repertoire of protein folds (around 800) can already be mapped onto about half of all the amino-acid residues encoded in the currently available genome sequences.
Encouragingly, and in parallel with the expansions in the structure and sequence databanks over the last decade, powerful new technologies have been developed for recognizing relationships between proteins on the basis of sequence and/or structural similarity . These allow the universe of protein-family space to be more accurately charted, by allowing recognition of extremely distant homologs.
Although Wolf et al.  attempted to predict the number of folds in individual genomes, most estimates consider the total number of folds in all of nature. Current estimates of the number of folds range from 1,000 to 10,000, depending on the models and approximations applied [11–13]. One of the earliest estimates of fold numbers was a simple approximation by Chothia . This assumed that there is a limited number of folds in nature that sequences can adopt, given the intrinsic physical constraints. If these are randomly sampled in the projects that solve protein structures, then the probability that a new protein sequence has a known fold can be estimated by determining the proportion of unrelated sequences, for example in the structure classifications database SCOP [15, 16], that share the same fold as one another and are therefore likely to share that fold with the new sequence. This approach predicted around 1,000 folds, given the proportion of sequences of known structure in SCOP that had unique folds, the fraction of the Swiss-Prot sequence database [17, 18] these sequences comprised, and the fraction of new sequences found to be related to sequences already in Swiss-Prot.
Although similarity in the folds adopted by different families may reflect folding preferences and convergence to energetically stable folds, it is likely that many of the families that adopt the superfolds are in fact very distantly related, beyond the sensitivity of current algorithms to detect homology at the sequence level. Families adopting the eight-stranded α/β TIM-barrel folds are a case in point, with recent analysis suggesting that many of these families may have evolutionary links - an idea that is supported by unusual sequence signatures and functional properties [25, 26].
Since Chothia's early estimates , several groups have applied more sophisticated statistical approaches that model the uneven distribution of fold usage in various ways [22, 24]. Random sampling of known sequence families and assigning equal likelihood to each fold gives rise to a non-uniform fold distribution which, when further modified to account for the extreme bias of the superfolds and the fact that many folds are only rarely seen in nature, gives an estimate of 4,000 folds .
Coulson and Moult  assume the existence of three types of folds: superfolds, which are adopted by very many protein families and are highly recurrent within proteomes; mesofolds, which have an intermediate number of protein families associated with them; and unifolds, adopted by a single narrow sequence family. On the basis of this assumption, they simulated the expansion of new folds classified in the SCOP structure database over the preceding two years, as a fraction of new sequence families added. Assuming a maximum of 50,000 protein families in nature, this approach predicts up to 400 mesofolds and some 10,000 unifolds in addition to 9 superfolds. Perhaps more importantly, the majority of sequence families belong to superfold and mesofold groups, and for 80% of these families we probably already know the fold.
Several groups have attempted to model the uneven fold-family distribution using power laws. Power law distributions - in which a small number of high-frequency instances occur, but there is a moderate number of common instances and a huge number of very rare instances - appear to be ubiquitous in nature and society, and seem to explain many of the biological trends recently revealed by genome data, such as protein-family distributions, domain associations, and protein-protein interactions [13, 27, 28]. Karev et al. [29, 30] model protein-family distributions by simulating the birth (gene duplication), death (gene loss) and innovation (new protein) of different domains in individual genomes. Although this entirely stochastic model fails to account completely for the observed distribution, it shows that a close fit is possible using a model with only three independent parameters. Implicit in the model is the notion that the 'fit' get 'fitter', and domains randomly duplicated early in evolution increasingly dominate the population. None of these models incorporates selection pressures that might operate to favor the retention of duplicated domains performing important biochemical activities. But, in fact, many highly recurrent domains do appear to have important biochemical functions, for example in providing energy or redox equivalents for enzyme reactions, or in responding to cellular signals and binding to DNA [31, 32].
These more recent models of the number of folds [12, 22–24, 29, 30] continue to ignore possible biases in the structure and sequence databases. For example, it is likely that proteins sampled for structure determination have been relatively easy to solubilize, purify and crystallize - as shown by the small numbers of transmembrane structures known. Perhaps more worrying are recent analyses suggesting that we have barely sampled sequence and family space, as each new genome adds more families and there is no sign of saturation in this expansion . Even with the huge advances in genome sequencing, there are still at least ten million organisms as yet uncharacterized .
To be more optimistic, though, it is likely that as the sequence and structure databases expand, making it easier to link relatives and also increasing the sensitivity of the profile-based homology search methods and fold-recognition methods, there may be a considerable coalescence of families. Assessment of several widely used homolog-detection methods (such as PSI-BLAST and hidden Markov models, HMMs) using structurally validated homologs has shown significant increases in performance accompanying expansions in the sequence and structure databases .
Given that most estimates of how many folds there are depend heavily on the numbers of protein families that have been identified and their mapping to existing folds, it is useful to briefly consider the current strategies and technical challenges involved in identifying these families. Structural genomics initiatives have promoted several new sequence-based approaches to recognizing protein families. These arose because although there are many well-established protein-family databases (such as PRINTS [34, 35], Pfam [36, 37], SMART [38, 39], ProDom [40, 41], InterPro [42, 43], TIGRFAMs [44, 45] and MIPS [46, 47]) most cover only a relatively small proportion of the known sequences. Pfam [36, 37], which now includes over 7,000 manually curated families, identifies many of the largest protein families, and any lack of coverage is addressed to a certain extent by InterPro [42, 43], which integrates Pfam with several other protein-family resources. The advantage of all these curated databases is that relatives are recognized using family-specific sequence profiles or regular expressions, and there is some degree of manual validation.
Faster approaches for identifying protein families within very large datasets (such as those in non-redundant GenBank [48, 49] or Swiss-Prot/TrEMBL [17, 18]) often involve aligning the sequences against each other using BLAST and then clustering those with significant similarity [50–54]. The simplest protocols use single-linkage clustering, which often collapses too many families, giving relatives with insufficient global similarity. In ProtoNet [50, 51] these effects are robustly handled by permitting alternative user-defined thresholds for clustering that allow granularity to range from families with small closely related proteins to much broader families comprising proteins sharing common sequence motifs. Some of the most promising new methods employ Markov clustering, in particular the TribeMCL  implementation developed by Enright and co-workers and used by the TRIBES [56, 57] and Gene3D resources (our unpublished data and ).
One of the hardest problems in clustering sequences into protein families is handling the similarities between multi-domain proteins and the fact that many different multi-domain proteins share common domains but in different contexts. A significant proportion  of proteins are multi-domain - up to 80% in eukaryotes. Furthermore, Teichmann and others  have shown that domains have frequently been shuffled and recombined in different ways within genomes, often giving rise to subtly different functions .
This recurrence of domains suggests their importance as primary evolutionary units, and although some researchers hypothesize that smaller supersecondary structural motifs may be the building blocks of evolution , the majority of globular compact folds characterized to date comprise whole domains. Thus, although some protein-family resources cluster complete gene sequences into families, most attempt to divide proteins into their constituent domains before or after clustering. Recognizing the boundaries of domains is a non-trivial algorithmic challenge, however, particularly if no structural data are available. Even methods based on structures disagree in their assignments 20-40% of the time . The problem is compounded by discontinuities in some domain sequences, whereby the insertion of a second domain disrupts an existing domain within a multi-domain protein. Structural data in the CATH database [20, 21] suggest that these discontinuities exist in about 23% of domains occurring in multi-domain proteins .
Some of the most successful approaches to the problem of domain-boundary prediction combine sequence data with the propensities of particular amino-acid residues, using neural networks [54, 63, 64]. Other methods exploit the recurrence of domains in different contexts to identify boundaries from multiple alignments [40, 65, 66]. The elegant approach of Heger and Holm (named ADDA ) exploits graph theory to build networks of domain links in multi-domain proteins from which multiple alignments can be extracted and recursively analyzed and chopped up to yield their single-domain components.
Estimates of the number of protein families that have so far been identified vary substantially, depending on the sequence datasets clustered and the thresholds employed. The ADDA algorithm of Heger and Holm  identifies some 34,000 domain families in a combined sequence dataset - derived from Swiss-Prot, TrEMBL, the Protein Information Resource (PIR), PDB, the Caenorhabditis elegans protein database Wormpep and Ensembl genome databases - which, after removing redundancy at 40% sequence identity, contained almost 250,000 protein sequences. These are chopped into domains and then clustered into 34,000 domain families. Almost 170,000 domains remain as singletons that are not clustered into any family. Similarly, a recent analysis by Liu and Rost , chopping and clustering sequences from eukaryotic genomes, suggested 17,000 domain-like clusters (regions likely to be domains) in eukaryotes that are likely to have a currently unidentifiable globular structure. Again these represent low estimates, as the eukaryotic genomes currently contribute about half of the total sequences within completed genomes. A more recent publication reports 63,000 domain families from the clustering of 62 complete genomes [68, 69].
Many of the largest families in Gene3D are very sequence-diverse and are perhaps better described as superfamilies, containing some very distant homologs (proteins with less than 20% sequence identity). Thus, although Gene3D identifies almost 53,000 domain superfamilies, these comprise 205,000 close families, in which relatives have 35% or more sequence identity; and at least 20% of these close families have one or more members with at least 35% sequence identity to a known structure. This suggests that structural genomics initiatives would need to target representatives for the remaining 165,000 or so families to obtain good structural models for all families in the examined genomes.
A summary of the families and superfamilies within Gene3D
Type of family
Proportion of non-singleton domains
Number of superfamilies
Number of close families
Number of folds
Known structure (CATH)
759 + 54
Superfolds (all of known structure)
Unknown structure (Pfam)
Total, excluding singletons:
Several analyses (for example [74, 75]) have shown that approximately 22% of predicted protein sequences from genome sequences (which will overlap to some extent with CATH and Pfam assignments) contain transmembrane regions, and about 10-20% of predicted sequences contain long regions (50-100 amino acids) of disorder or low complexity. There is also a significant proportion (around 16%) of small amino-acid sequences with no predicted secondary structure .
Are the singletons - of which there are currently 60,000 in Gene3D - in fact distant relatives of existing families that are not recognized by current algorithms, or are they genuinely unique sequences having novel folds? Kunin and co-workers  recently showed that although some singletons are reassigned to families as new genomes are completed, there is still an overall gain in the number of singletons with each additional sequenced genome. This may change as the databases expand and recognition methods improve. Original estimates of the proportion of singletons in bacterial genomes lay at about 50% , but this number has steadily fallen, with average values of 30% for the first release of Gene3D in 2002 , and 18% for more recent releases of Gene3D . Some proportion of these proteins may nevertheless represent genuinely new families and folds.
The length distribution of singletons is lower than the length distribution for the average structural domain , and many of the very small sequences containing disordered regions may correspond to unstructured proteins existing only as complexes and/or peptides involved in regulation and binding to DNA. These proteins may therefore not fold independently and will lie outside the range of targets amenable to structural genomics.
Using the number of domain families identified by Gene3D (see Figures 1 and 4), we can make a very simple approximation of the total number of folds in nature by making the following four assumptions. First, we assume that we now know the folds for all the superfolds - defined as folds with three or more homologous superfamilies in CATH (at present this accounts for 71,080 close domain families for 54 highly populated CATH folds; see Table 1). Second, we assume that we have been able to map these folds onto all their relatives in the genome sequences, and so we can remove these folds and families while estimating the remaining numbers of folds. Third, we assume that singletons can be removed from the estimate, as they are probably very distant relatives belonging to known folds that have diverged beyond the sensitivity of current recognition methods, or else they are short sequences unlikely to fold independently but associated with functional complexes. Although singletons could represent novel folds and could therefore skew any estimate of the total number of protein folds, they do not represent a significant proportion of domains. Finally, we assume that non-superfolds and non-singletons have been sampled randomly by families in nature and that there are no biases in their representation within the current sequence and structure databases.
Removing the 54 superfolds from the Gene3D dataset leaves 22,491 close domain families of known structure (see Table 1), which adopt 759 folds in CATH (see Figure 1). We can therefore expect the remaining 114,695 domain families in Gene3D that are of unknown structure (Pfam close domain families plus NewFam close domain families) to adopt (114,695/22,491) × 759, or 3,871 new folds. Adding together the superfolds, known folds and estimated number of new folds (54 + 759 + 3,871) we get an estimate of the number of folds encoded within the 120 genomes included in Gene3D of 4,684. This will probably be a lower bound for the total number of protein folds in nature. But all fold estimates are unsatisfying, in that they necessitate simplified models of fold usage and optimism regarding lack of bias in the databases; whilst our sampling of 'species' space remains so sparse, calculations on the numbers of folds in all of nature seem rather esoteric.
Perhaps a more optimistic outlook for the structural genomics initiatives comes from the observation that fewer than 1,000 large CATH and Pfam families map to a significant proportion (around 60%) of all the predicted products of genome sequences, excluding singletons (see Figure 4). What roles are relatives from these large families performing and why are they recurring so frequently within the proteomes?
We used Gene3D to examine the recurrence of structurally characterized families in the predicted proteomes of a set of 56 bacterial genomes . Interestingly, some 274 CATH-defined families are common to a significant proportion of these genomes. Less than 30 of the families are highly duplicated, dominating almost 50% of all the CATH-annotated genome sequences. In these families, domain recurrence in any proteome correlates with genome size and, in some families, domains are frequently located in proteins with different domain compositions . Many are associated with metabolic pathways, where they perform generic functions such as the provision of energy or redox equivalents for reactions. Frequently some aspect of the chemistry is conserved between paralogs, but substrate specificity may have been modulated by changes in the geometry of active sites. In some cases structural embellishments to the fold cause changes in surface geometry, modulating protein-protein interactions and altering the repertoire of domain associations . A significant proportion adopt a small number of folds, namely TIM-barrel folds, Rossmann-like folds or αβ-plait superfolds. Interestingly, these are among the most ancient folds [78, 79]. They all possess simple, regular, layered architectures that might be expected to promote optimal packing of hydrophobic residues in the core of the protein. In support of these hypotheses, Caetano-Anolles and Caetano-Anolles  have also proposed that αβ sandwiches and β-barrel-like structures evolved first, with β sandwiches evolving later, predominantly in eukaryotes, where the all-β immunoglobulin superfold recurs extensively. The regularity of their architectures may contribute to the ease with which these folds have been observed to tolerate residue mutations , allowing some of the families to diverge further and to adopt a range of different functions.
In addition, functional utility may also contribute to the wide recurrence of these domains . As Koonin and co-workers propose , some perform generic functions that are well conserved (for example, nucleotide binding in the Rossmann-like domains) and have been re-used in multiple functional contexts (in different pathways or cellular locations). Alternatively, as in the case of TIM barrels and αβ-plait folds, these architectures possess functional sites (for example the base of the β barrel in the TIMs or the exposed β-sheet surface in the αβ-plaits) that can easily be re-engineered to bring diverse combinations of residues into contact, thereby creating novel catalytic environments.
Our estimates here, made using Gene3D, suggest that the largest, most recurrent families encoded within the sequenced genomes have already been characterized in the CATH database and can be expected to adopt about 800 folds. How realistic are our simple estimates of approximately 3,900 folds to be adopted by the remaining families, most of which are characterized in Pfam and some of which are quite small? (For example, Figure 2 shows that the remaining uncharacterized NewFam families are generally much smaller than the CATH and Pfam families.) Small families may turn out to be very distant relatives of superfolds that have diverged beyond recognition, and in acquiring highly specialized functions these now have the narrow sequence constraints observed today . Some may be completely new folds, however, that have arisen by more recent shuffling of subdomains and motifs. Soding and Lupas  have presented some intriguing models of evolutionary pathways using diverse recombination of small common submotifs such as a hairpins and αβ motifs. There are fascinating examples of relatives in some families that appear to have acquired new folds through subtle rearrangements within supersecondary motifs [60, 81].
It is clear that some common structural motifs are highly reused , and this has meant that fold space should perhaps more accurately be viewed as a continuum [83, 84], where significant structural overlaps occur in some regions. For the most highly populated architectures within CATH (αβ sandwiches and β sandwiches), folds are often highly 'gregarious' (that is, some subcomponents of the fold are shared with other folds), with at least 40-50% of their structures overlapping structures from other fold groups. Given that the relatives in many large superfamilies adopting these architectures (for example, superfamilies adopting Rossmann-like folds or αβ-plait folds) can be highly structurally divergent, with only 50% of residues in the core remaining structurally conserved during evolution , these overlaps can create problems in identifying distinct regions within fold space. The continuous nature of fold space may mean that simulations exploring the number of folds in nature are unrealistic, and that it may be more useful to try to understand the mechanisms by which common motifs can be assembled.
In this context, it is notable that there have recently been some considerable successes in ab initio structure prediction, using approaches that assemble proteins from peptide fragment libraries derived from known structures . There now appear to be structural representatives for most 10-15 residue peptides , particularly those occurring within secondary structures, and so these advances may become increasingly important for structural modeling of the large number of singletons and 'unifolds' revealed by genome analyses. Such coarse models could help in suggesting the location of an active site or functional interface, perhaps allowing the putative biochemical role of the protein to be modeled in a systems biology context, even if they are not of sufficiently high accuracy to allow drug design.
In summary, attempts to predict the total number of folds in nature are still hampered by uncertainties and approximations. Most calculations predict somewhere in the range of 1,000-10,000 folds. Encouragingly for our understanding of evolution and biological systems, we now know the fold for many of the largest families, in particular those that dominate the genome annotations. Some 800 CATH folds and an additional 1,830 structurally uncharacterized Pfam families can already be assigned to approximately 70% of proteins predicted from genome sequences. Structural genomics initiatives that target the large structurally uncharacterized families can be expected to succeed in mapping fold space for a significant proportion of sequence space over the coming years.