Tools and resources for identifying protein families, domains and motifs
© BioMed Central Ltd 2001
Published: 19 December 2001
With the large influx of raw sequence data from genome sequencing projects, there is a need for reliable automatic methods for protein sequence analysis and classification. The most useful tools use various methods for identifying motifs or domains found in previously characterized protein families. This article reviews the tools and resources available on the web for identifying signatures within proteins and discusses how they may be used in the analysis of new or unknown protein sequences.
What is the problem to be solved?
In June 2000 the first draft of the human sequence was announced, and was considered to be an achievement equal to that of putting the first man on the moon. The announcement brought promises of breakthroughs in treating human diseases, but in fact all it meant was a flood of data to be converted into useful biological information. To live up to the promise of the sequence, the first obstacles are to classify the genes it contains and to assign functions to the gene products . Protein sequences can be classified by identifying the protein type, but they then need to be characterized further, to assign biological function. The challenge is in this application of useful biological knowledge to particular protein sequences.
There are several reasons to choose to characterize proteins rather than DNA sequences. These include: the larger alphabet (21 amino acids versus 4 bases); the lower signal-to-noise ratio in protein sequence searches; the closeness between protein sequence and function; and the availability of good, well annotated databases of protein sequences and protein sequence signatures. Proteins can be characterized at different levels: they perform a function in a cell, but this function is also performed within a particular context, for example as part of a complex pathway, as well as at a defined cellular location. At the functional level, this may come down to analysis of the protein sequence along its entire length, at the level of single domains or motifs, or at the finest level, single important amino-acid residues. With the increased availability of completely sequenced genomes, and using the correct tools and resources, there is scope for protein characterization on all these levels.
The first step in the analysis of new or uncharacterized protein sequences is traditionally to search the protein databases for similar sequences. The main protein sequence databases available are SWISS-PROT and TrEMBL [2,3], the Protein Information Resource (PIR) [4,5], and GenPept, which is a translation of GenBank [6,7]. If the similarity to proteins in a database is significant, information from the proteins in the search results can be inferred to apply also to the query sequence. This relies on the quality of the annotation in the protein sequence databases, and, more generally, on the availability of experimental results in the scientific literature. But problems arise during sequence-similarity searches when more than one domain is present in a protein . A large number of matches to one domain in a sequence may mask 'hits' that match a second domain in the sequence, and thus useful information is lost. It is also possible for sequences to be evolutionarily related but for their sequences to have diverged to such an extent that they are not picked up in a sequence-similarity search. And, with the increase in population of protein sequence databases, the number of related sequences rises, so when a search is performed it identifies a large set of highly related sequences and the less related sequence hits may be lost. It is for these reasons that protein signature databases evolved and have become increasingly useful tools for protein sequence analysis; they aim to identify domains, or classify proteins into families, and thereby infer function. A signature refers to the diagnostic entity used to recognize a domain or family; it may be derived using a number of methods, including patterns and profiles (discussed below). This article presents the main signature databases available for protein sequence analysis, their methods, and their individual uses.
How is it done?
The basic information about a protein comes from its sequence. From a single sequence it is difficult to infer anything about the protein, but as the number of related sequences increases, so an alignment can be built to create a consensus for a protein family, or to identify conserved domains or highly conserved residues that may be important for function, for example in an active site. These conserved areas of a protein family, domain or functional site can be used to define identifiable features using several different methods. These include building up regular expressions to show patterns of conserved amino-acid residues ('pattern' is used here to mean a precise, contiguous stretch of sequence); producing detailed profiles from sequence alignments; and hidden Markov models (HMMs), which are profiles derived using a more complex probabilistic scoring mechanism. A profile is built from a sequence alignment, and describes the probability of finding an amino acid at a given position in the sequence; the profile constitutes a table, or matrix, of position-specific amino-acid weights and gap costs . The numbers in the table (scores) are used to calculate similarity scores between a profile and a sequence within a given alignment. A threshold score is calculated for each set of sequences, so that only sequences scoring above this threshold are considered to be related to the original set of sequences in the alignment. Each method has its own advantages. For example, patterns are relatively simple to build and are very useful for small regions of conserved amino acids, such as active sites or binding sites - but they fail to provide information about the rest of the sequence - and because of the constraints on which amino acids may be found within a given area of the sequence, patterns fail to pick up related sequences that have even a small divergence in that particular area. Profiles and HMMs compensate for these problems in that they generally cover larger areas of the sequence, and because all amino acids have a chance of occurring at a given position, albeit with a lower probability or score, more divergent family members may still be included in the hit list (the term 'hit list' in this article refers to the list of proteins that match or contain a particular signature above the required score).
Useful tools and resources for protein family, domain and motif analysis
Database of protein alignment blocks
Conserved domain database
Clusters of SWISS-PROT and TrEMBL proteins
Protein-domain database based on sequence alignments
Integrated documentation resource for protein families, domains and functional sites
Integrated protein classification database
Database of protein family information
Collection of multiple sequence alignments and hidden Markov models
Protein Information Resource
Curated database of protein sequence alignments
Compendium of protein fingerprints
Non-redundant protein database organized by family relationships
Automatic compilation of homologous domains
Database of patterns and profiles describing protein families and domains
Automatic hierarchical classification of SWISS-PROT proteins
Curated protein domain library based on sequence clustering
Simple Modular Architecture Research Tool - a collection of protein families and domains
SWISS-PROT and TrEMBL
Protein sequence databases
Systematic re-searching method for sequence searching and clustering
Protein families based on hidden Markov models
What is available?
PROSITE patterns and profiles
PROSITE [9,10] is a database of both patterns and profiles. PROSITE patterns are built from alignments of related sequences, which are taken from a variety of sources: from a well-characterized protein family; derived from the literature; from the results of sequence searches against SWISS-PROT and TrEMBL; or from sequence clustering. The alignments are checked for conserved regions, which, particularly for the characterized protein families, may have been experimentally shown to be involved in the catalytic activity or to bind a substrate. A core pattern is created in the form of a regular expression that specifies which amino acid(s) may or may not occur at each position. Regular expressions are text strings that describe patterns, used to represent a set of strings. They can be seen as similar to wildcard pattern-matching tools used traditionally under Unix and Unix-like operating system utilities. Regular expressions are much more elaborate and powerful than standard wildcard expressions, but they are also much more complex. Once the core pattern is made, it is tested against the sequences in SWISS-PROT. If the correct set of proteins matches this pattern then it is kept; if it fails to pick up some family members or picks up too many unrelated proteins, the pattern is refined and re-tested until it is optimized.
Patterns have many advantages, but they also have their limitations across whole sequences, which is why PROSITE also creates profiles , to complement the patterns. For these, the process also starts with multiple sequence alignments; it then uses a symbol comparison table to convert residue frequency distributions into weights, resulting in a table of position-specific weights . A symbol comparison table comprises values describing the comparison between pairs of amino acids. The table has a value for the match quality of every possible pair of amino acids, and is used to provide scores for the probability of one amino acid being replaced by another at a particular position within the sequence alignment. These numbers are used to calculate a similarity score for the alignment between the profile and sequences in SWISS-PROT; an alignment with a similarity score equal to or greater than a given cut-off value constitutes a true hit. The profile is then refined until only the intended set of protein sequences scores above the threshold for the profile.
Pfam, SMART and TIGRFAMs HMMs
Many databases, such as Pfam, SMART and TIGRFAMs, use HMMs as a way of creating diagnostic signatures for protein families, domains and repeats. The HMMs are built  from manually curated sequence alignments using the HMMER2 package , which is based on Bayesian statistical models. Pfam [13,14] is a collection of multiple protein-sequence alignments and HMMs, and provides a good repository of models for identifying protein families, domains and repeats. There are two parts to the Pfam database: PfamA, a set of manually curated and annotated models; and PfamB, which has higher coverage but is fully automated (with no manual curation). PfamB HMMs are created from alignments generated by ProDom [21,22] in their automatic clustering of the protein sequences in SWISS-PROT and TrEMBL.
The SMART database ('simple modular architecture research tool') [15,16] produces HMMs that facilitate the identification and annotation of genetically mobile domains and the analysis of domain architectures. The database is highly populated with models for domains found in signaling, extracellular and chromatin-associated proteins. The models rely on hand-curated multiple sequence alignments of representative family members, based on tertiary structures where possible but otherwise found by PSI-BLAST . Once the models are created, they are used to search the database for additional members to be included in the sequence alignment. This iterative process is repeated until no further homologs are detected. TIGRFAMs [17,18] creates HMMs that group homologous proteins that are conserved with respect to function. The models are produced in a similar way to those in Pfam and SMART, but should only hit equivalogs, proteins that have been shown to have the same function.
The PRINTS-S database [11,12] uses 'fingerprints' as diagnostic signatures, in a variation on the methods described above. A fingerprint is a group of conserved motifs used to characterize a protein family. Rather than focusing solely on small conserved areas, the occurrence of these conserved areas across the whole sequence is taken into account. Once again the starting point is a curated multiple sequence alignment. Profiles are built for small conserved regions in the sequence, and together these make up a fingerprint. The 'fingers', or motifs, are required to be present in the sequence in the correct order for the fingerprint to be counted as a match in a target sequence. During the creation of fingerprints each motif is used to scan the protein sequence database and the resulting hit lists are correlated, to add sequences to the original alignment. New motifs are then generated and the process is repeated until convergence. Recognition of individual elements in the fingerprint is mutually conditional, and true members match all elements in the correct order, while members of a subfamily may match only part of the fingerprint. Many fingerprints have been created to identify proteins at the superfamily as well as the family and subfamily levels; for this reason, many of the fingerprints are related to each other in an ordered hierarchical structure.
Clustering and alignment
An example of a database that solely uses sequence clustering and alignment methods is ProDom [21,22]. This database takes all proteins in the SWISS-PROT and TrEMBL protein databases, removes fragments, identifies the smallest remaining sequence and uses this as a query sequence to search the SWISS-PROT/TrEMBL protein database using PSI-BLAST . The hit-list sequences are made into a new ProDom domain family and removed from the protein database. The remaining sequences are once again sorted by size, and the smallest sequence is again used as a query sequence. This process is repeated until there are no more sequences in the protein database . In this way, ProDom groups all the non-fragment sequences in SWISS-PROT and TrEMBL into more than 150,000 families. Other major alignment databases are: PIR-ALN [26,27], which is a database of annotated protein sequence alignments derived automatically from the PIR sequence database and has alignments at the superfamily and domain levels; and ProtoMap [30,31], an automatic classification of all SWISS-PROT and TrEMBL proteins into groups of related proteins based on pair-wise similarities.
Another protein signature database worth a mention is Blocks [19,20], a collection of multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. These alignments are represented as profiles, built up using a tool called PROTOMAT . The profiles are calibrated against the SWISS-PROT database, and the LAMA software tool  is used to search new blocks against existing blocks. Blocks+ is a new, extended version of the original Blocks database .
Which one(s) should you use?
The benefits of integration
The integrated resources for protein family and domain signature databases, such as InterPro, MetaFam and CDD, have several uses, not only for the scientific community, for whom they build on the individual strengths of the different methods, but also for the member databases themselves. The integration reduces duplication of effort for the member databases in the labor-intensive, rate-limiting process of annotation, and also facilitates communication between the disparate resources. The integrated resources provide quality control mechanisms for assessing individual methods, and also highlight the areas where all the member databases are lacking in representation. This situation is improved by the increasing availability of complete genome sequences, which help to identify uncharacterized protein families that may be unique to single or groups of related organisms.
It is evident that there are currently a large number of high-quality protein signature databases and integrated databases available for automatic and large-scale protein classification. The challenge remains, however, in the transfer of useful biological knowledge to protein sequences. Automatic methods may provide some useful suggestions of protein architecture or function, but only a biologist can truly assign function to a protein, using these results, and the ultimate confirmation of these assignments must remain experimental evidence.
- Ponting CP: Issues in predicting protein function from sequence. Brief Bioinform. 2001, 2: 19-29.PubMedView ArticleGoogle Scholar
- Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000, 28: 45-48. 10.1093/nar/28.1.45.PubMedPubMed CentralView ArticleGoogle Scholar
- SWISS-PROT and TrEMBL. [http://www.expasy.ch/sprot/sprot-top.html]
- Barker WC, Garavelli JS, Hou Z, Huang H, Ledley RS, McGarvey PB, Mewes HW, Orcutt BC, Pfeiffer F, Tsugita A, et al: Protein Information Resource: a community resource for expert annotation of protein data. Nucleic Acids Res. 2001, 29: 29-32. 10.1093/nar/29.1.29.PubMedPubMed CentralView ArticleGoogle Scholar
- Protein Information Resource. [http://pir.georgetown.edu/]
- Burks C, Cassidy M, Cinkosky MJ, Cumella KE, Gilna P, Hayden JE, Keen GM, Kelley TA, Kelly M, Kristofferson D, et al: GenBank. Nucleic Acids Res. 1991, Suppl 19: 2221-2225.View ArticleGoogle Scholar
- GenBank. [http://www.ncbi.nlm.nih.gov/Genbank/]
- Gribskov M, Luthy R, Eisenberg D: Profile analysis. Methods Enzymol. 1990, 183: 146-159.PubMedView ArticleGoogle Scholar
- Hofmann K, Bucher P, Falquet L, Bairoch A: The PROSITE database, its status in 1999. Nucleic Acids Res. 1999, 27: 215-219. 10.1093/nar/27.1.215.PubMedPubMed CentralView ArticleGoogle Scholar
- PROSITE. [http://www.expasy.ch/prosite/]
- Attwood TK, Croning MDR, Flower DR, Lewis AP, Mabey JE, Scordis P, Selley JN, Wright W: PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res. 2000, 28: 225-227. 10.1093/nar/28.1.225.PubMedPubMed CentralView ArticleGoogle Scholar
- PRINTS-S. [http://bioinf.man.ac.uk/dbbrowser/sprint/printss_lis.html]
- Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer ELL: The Pfam Protein Families Database. Nucleic Acids Res. 2000, 28: 263-266. 10.1093/nar/28.1.263.PubMedPubMed CentralView ArticleGoogle Scholar
- Pfam. [http://www.sanger.ac.uk/Software/Pfam/]
- Ponting CP, Schultz J, Milpetz F, Bork P: SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res. 1999, 27: 229-232. 10.1093/nar/27.1.229.PubMedPubMed CentralView ArticleGoogle Scholar
- SMART. [http://smart.embl-heidelberg.de/]
- Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O: TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res. 2000, 29: 41-43. 10.1093/nar/29.1.41.View ArticleGoogle Scholar
- TIGRFAMs. [http://www.tigr.org/TIGRFAMs/]
- Henikoff JG, Greene EA, Pietrokovski S, Henikoff S: Increased coverage of protein families with the blocks database servers. Nucleic Acids Res. 2000, 28: 228-230. 10.1093/nar/28.1.228.PubMedPubMed CentralView ArticleGoogle Scholar
- Blocks. [http://blocks.fhcrc.org/]
- Corpet F, Servant F, Gouzy J, Kahn D: ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 2000, 28: 267-269. 10.1093/nar/28.1.267.PubMedPubMed CentralView ArticleGoogle Scholar
- ProDom. [http://prodes.toulouse.inra.fr/prodom/doc/prodom.html]
- Gracy J, Argos P: Automated protein database classification: I. Integration of compositional similarity search, local similarity search and multiple sequence alignment. Bioinformatics. 1998, 14: 164-173. 10.1093/bioinformatics/14.2.164.PubMedView ArticleGoogle Scholar
- Gracy J, Argos P: Automated protein database classification: II. Delineation of domain boundaries from sequence similarities. Bioinformatics. 1998, 14: 174-187. 10.1093/bioinformatics/14.2.174.PubMedView ArticleGoogle Scholar
- DOMO. [http://www.infobiogen.fr/services/domo/]
- Srinivasarao GY, Yeh LS, Marzec CR, Orcutt BC, Barker WC: PIR-ALN: a database of protein sequence alignments. Bioinformatics. 1999, 15: 382-390. 10.1093/bioinformatics/15.5.382.PubMedView ArticleGoogle Scholar
- PIR-ALN. [http://www-nbrf.georgetown.edu/pirwww/dbinfo/piraln.html]
- Huang H, Xiao C, Wu CH: ProClass protein family database. Nucleic Acids Res. 2000, 28: 273-276. 10.1093/nar/28.1.273.PubMedPubMed CentralView ArticleGoogle Scholar
- ProClass. [http://pir.georgetown.edu/gfserver/proclass.html]
- Yona G, Linial N, Linial M: ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res. 2000, 28: 49-55. 10.1093/nar/28.1.49.PubMedPubMed CentralView ArticleGoogle Scholar
- ProtoMap. [http://www.protomap.cs.huji.ac.il/]
- Krause A, Stoye J, Vingron M: The SYSTERS protein sequence cluster set. Nucleic Acids Res. 2000, 28: 270-272. 10.1093/nar/28.1.270.PubMedPubMed CentralView ArticleGoogle Scholar
- SYSTERS. [http://systers.molgen.mpg.de/]
- Kriventseva EV, Fleischmann W, Zdobnov EM, Apweiler R: CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res. 2001, 29: 33-36. 10.1093/nar/29.1.33.PubMedPubMed CentralView ArticleGoogle Scholar
- CluSTr. [http://www.ebi.ac.uk/clustr/]
- Bucher P, Karplus K, Moeri N, Hofmann K: A flexible motif search technique based on generalized profiles. Comput Chem. 1996, 20: 3-23. 10.1016/S0097-8485(96)80003-9.PubMedView ArticleGoogle Scholar
- Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press,. 1998View ArticleGoogle Scholar
- HMMER2: Profile hidden Markov models for biological sequence analysis. [http://hmmer.wustl.edu/]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Gouzy J, Corpet F, Kahn D: Whole genome protein domain analysis using a new method for domain clustering. Comput Chem. 1999, 23: 333-340. 10.1016/S0097-8485(99)00011-X.PubMedView ArticleGoogle Scholar
- Henikoff S, Henikoff JG: Automated assembly of protein blocks for database searching. Nucleic Acids Res. 1991, 19: 6565-6572.PubMedPubMed CentralView ArticleGoogle Scholar
- Pietrokovski S: Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res. 1996, 24: 3836-3845. 10.1093/nar/24.19.3836.PubMedPubMed CentralView ArticleGoogle Scholar
- Silverstein KA, Shoop E, Johnson JE, Retzel EF: MetaFam: a unified classification of protein families. I. Overview and statistics. Bioinformatics. 2001, 17: 249-261. 10.1093/bioinformatics/17.3.249.PubMedView ArticleGoogle Scholar
- MetaFam. [http://metafam.ahc.umn.edu/]
- Wu CH, Xiao C, Hou Z, Huang H, Barker WC: iProClass: an integrated, comprehensive and annotated protein classification database. Nucleic Acids Res. 2001, 29: 52-54. 10.1093/nar/29.1.52.PubMedPubMed CentralView ArticleGoogle Scholar
- iProClass. [http://pir.georgetown.edu/iproclass/]
- Wheeler DL, Church DM, Lash AE, Leipe DD, Madden TL, Pontius JU, Schuler GD, Schriml LM, Tatusova TA, Wagner L, Rapp BA: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2001, 29: 11-16. 10.1093/nar/29.1.11.PubMedPubMed CentralView ArticleGoogle Scholar
- CDD: A Conserved Domain Database and Search Service. [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml]
- Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MDR, et al: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 2001, 29: 37-40. 10.1093/nar/29.1.37.PubMedPubMed CentralView ArticleGoogle Scholar
- InterPro. [http://www.ebi.ac.uk/interpro/]
- Murvai J, Vlahovicek K, Barta E, Pongor S: The SBASE protein domain library, release 8.0: a collection of annotated protein sequence segments. Nucleic Acids Res. 2001, 29: 58-60. 10.1093/nar/29.1.58.PubMedPubMed CentralView ArticleGoogle Scholar
- SBASE. [http://www3.icgeb.trieste.it/~sbasesrv/]
- Jalview - a java multiple alignment editor. [http://www.ebi.ac.uk/~michele/jalview/]
- Corpet F: Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 1988, 16: 10881-10890.PubMedPubMed CentralView ArticleGoogle Scholar
- Zdobnov EM, Apweiler R: InterProScan - an integration platform for the signature-recognition methods in InterPro. Bioinformatics. 2001, 17: 847-848. 10.1093/bioinformatics/17.9.847.PubMedView ArticleGoogle Scholar