THoR: a tool for domain discovery and curation of multiple alignments
© Dickens and Ponting; licensee BioMed Central Ltd. 2003
Received: 20 May 2003
Accepted: 25 June 2003
Published: 23 July 2003
We describe a tool, THoR, that automatically creates and curates multiple sequence alignments representing protein domains. This exploits both PSI-BLAST and HMMER algorithms and provides an accurate and comprehensive alignment for any domain family. The entire process is designed for use via a web-browser, with simple links and cross-references to relevant information, to assist the assessment of biological significance. THoR has been benchmarked for accuracy using the SMART and pufferfish genome databases.
Data emanating from the genome-sequencing projects are flooding sequence databases. Without informative annotation, these sequences will not achieve their full potential of facilitating directed experimental research. One approach to annotation is to use efficient and automatic procedures to associate predictions of evolution, structure and function to genes. When annotating sequences from an evolutionary perspective, it is important to characterize sequences in terms of domains, which we define as being compact and spatially distinct protein structures . As domains are often present in different molecular contexts and in different combinations, it is necessary to employ sophisticated resources such as Pfam  and SMART  for their prediction. Each of these is a database of multiple sequence alignments, and associated hidden Markov models (HMMs) , and each exploits the HMMER package (profile HMM software for protein sequence analysis) . A problem for SMART is that the rapid expansion of protein-sequence databases has the consequence that multiple alignments are unable to be updated synchronously with newly deposited sequences. Here we describe THoR (the Thorough Homology Resource), a new tool that addresses this problem by automatically curating and updating multiple alignments for large domain families.
Identification of domain homologs for inclusion in SMART multiple sequence alignments  typically uses a combination of two algorithms, PSI-BLAST (position-specific iterative BLAST)  and HMMER . PSI-BLAST is a method of choice for identification of homologs with divergent sequences . In this method a query sequence is compared with a sequence database and candidate homologs are detected with Expect (E) values less than a threshold value (Eψ). E(x) is defined as the number of alignments expected in the database search with scores x or higher purely by chance. In subsequent iterations the database is searched using a position-specific score matrix that is calculated from a multiple sequence alignment of all the homologs detected in previous search rounds.
The main advantage of PSI-BLAST over HMMER is its speed. One of its major disadvantages is that it generates local alignments, rather than the global alignments that are required to delineate full-length domains. Thus, BLAST-derived local alignments often need to be extended in both directions to span intact domains. Furthermore, its searches are not symmetrical. If a sequence X is used to query a database and a sequence Y is predicted to be its homolog, then X is not necessarily always found to be a homolog when Y is used to search the same database . A final disadvantage is that multiple PSI-BLAST searches can generate a significant number of false-positive predictions. As discussed by Jones and Swindells , the results of multiple PSI-BLAST searches can provide error rates as high as one false positive in 13.2% of searches when these are taken to five rounds.
In contrast to PSI-BLAST, HMMER is able to generate accurate global alignments. The HMMER hmmsearch method uses a hidden Markov model (HMM), previously calculated from a multiple sequence alignment, to do a global search of a sequence database for candidate homologs, taken to be sequences detected with E values less than a threshold, EHMM. As a result of its heuristics, HMMER is often a better tool for the detection of short repeats or domains than PSI-BLAST.
Exhaustive PSI-BLAST searches that use multiple individual sequences as queries, have been shown to detect a high proportion of homologues in databases [8, 9]. However, the task of manually cross-referencing these multiple PSI-BLAST results to extract all predicted candidate homologs is extremely onerous and time-consuming. It is made even more difficult for populous domain families and for domains that are repeated often within single proteins by the large amount of data involved in the analyses. The system for easy analysis of lots of sequences (SEALS) package  eases this task by providing PSI-BLAST output parsers linked to sequence retrieval. However, SEALS can be difficult to install and cannot accurately retrieve full-length domain sequences from BLAST-derived local alignments.
The software package described here takes advantage of the speed and sensitivity of PSI-BLAST, together with the global alignment benefit of HMMER. This 'thorough homology resource' (THoR) provides a convenient web-based interface to the cross-referenced results of exhaustive PSI-BLAST database searches. The package takes a multiple sequence alignment as input and generates an updated and extended global alignment of all domain homologs, defined as those predicted by both the PSI-BLAST and HMMER methods. It also provides the benefits of convenient access to the search results and cross-referencing against the National Center for Biotechnology Information (NCBI) protein sequence and taxonomy databases.
Subsequently the multiple alignment A n is used to generate an HMM n . High-scoring global alignments against HMM n are then calculated for all sequences in DH and all HMMER HSPs (HSP-H) with E-values less than a threshold value (EHMM) are collected (Figure 1).
The domain homologs predicted by this protocol are defined as the intersection between the HMMER aligning regions, HMMER high-scoring pairs (HSP-H), and the PSI-BLAST HSPs (HSP-ψ). An HSP-H (with initial and final amino-acid coordinates A i , A j ) is considered to be equivalent to an HSP-ψ (with initial and final amino-acid coordinates B i , B j ) if:
where, by default, n = 60%, although this can be supplied by the user. However, values of n < 50% are more suitable for the identification of motifs, rather than domains.
Typically, Eψ <<EHMM, as these values are determined using different approaches and because PSI-BLAST is the method of choice for homolog detection. Eψ = 5 × 10-3 and EHMM = 1 are recommended values used in THoR for comparison against current protein-sequence databases. Higher values of Eψ result in increased false-positive rates, whereas higher EHMM values generate less accurate global alignments, albeit of likely true homologs predicted by PSI-BLAST.
All domain homologs identified using this approach are then realigned against the original sequence alignment A n , using hmmalign (Figure 1) with the model HMM n . There is a series of intermediate iterations of this process, where an HMM, constructed from the new alignment An,i, is then used to re-search the database of PSI-BLAST results. Where new domain homologs are revealed, that satisfy both EHMM and Eψ thresholds, these are appended to the alignment in an iterative manner until convergence, resulting in a revised multiple sequence alignment An+1. The entire process can then be repeated using An+1 as the input alignment for the subsequent run of THoR. In order to reduce redundancy, an option is provided to purge the alignment of one of a pair of sequences that are greater than a threshold percentage identity. At this point it is advisable that the user assesses the quality of the alignment An+1 for minor misalignments and fragments resulting from gene mispredictions or incomplete protein sequences. Advice for the construction and manual editing of multiple sequence alignments is given by Bork and Gibson , and Ponting and Birney .
The THoR analyses are optimized so that for any subsequent runs of THoR for the given alignment (An+1), only those sequences that are novel when compared with the previous alignment (A n ) will be subjected to PSI-BLAST searches and only results from these searches will be appended to those emanating from the previous runs of THoR.
The following benchmarking tests were used to investigate the performance of THoR.
Domain homolog identification
An HMM can be used to find domain family homologs using the HMMER program hmmsearch. This method has been used extensively for the annotation of new genomes [14, 15] and as such provides an excellent benchmark against which the success of THoR can be assessed. For the purposes of comparison, a set of 50 domain family multiple sequence alignments were selected from the SMART domain database. To prevent any domain-based bias in the searches the domains were chosen in such a way that there were representatives of all the different domain types; that is enzymes and non-enzymes, large and small in size, found broadly and narrowly in diverse taxa, with many and few homologs, and domains found in various cellular compartments (nuclear, cytoplasmic and secreted). Among the domains found in narrowly diverse taxa were domains that were not expected to be found within the pufferfish genome; these served as negative controls.
The pufferfish (Takifugu rubripes) genome  was chosen for benchmarking the THoR method for two reasons. First, this would substantially reduce any sequence-based bias in the searches, as there are few alignments in SMART that contain pufferfish sequences. If the human genome had been chosen instead, a bias in the searches would have arisen from the high representation of human and mammalian sequences in SMART alignments. Second, there are relatively few pufferfish sequences represented in the NCBI nonredundant (nr) database that will be part of the THoR search process. This maintains the nonredundancy of the database, which is important for the efficiency of both the PSI-BLAST and HMMER searches.
Typical timings for a complete round of THoR against the nr plus pufferfish database were taken for domains of different sizes. Two other factors that increase the search speed are the size of the target database and the frequency of occurrence of the domain within that database, simply because domains that occur more frequently produce more results that are analyzed and compared by the THoR program.
Stability of the search process
Domains that are highly represented within the database that is searched by THoR produce large files of results to be analyzed. Often a single domain family search produces gigabytes of PSI-BLAST results. Consequently it is essential that all searches are completed without encountering memory problems. All 50 of the benchmarking domains completed without these errors and successfully generated a subsequent alignment. However, the 'ATPases associated with diverse cellular activities' (AAA) domain family produced an alignment with 14,279 members which could not generate an HMM owing to a limitation in hmmbuild on the benchmarking machine. The WD40 repeat also encountered similar problems. One way to circumvent this limitation is described in the discussion.
For each domain family, lists of the results that were found only by THoR and those that were found only by hmmsearch were generated and then each item in these lists was examined manually in order to discover false-positive and false-negative results. This analysis combined PSI-BLAST and subsequent hmmsearch searches using the closest homologs for each hit as queries. Examination of the results that were unique to THoR revealed that only one putative false positive was present among the 50 resulting alignments. The only cases where hmmsearch identified homologs that THoR did not were domain alignments less than 40 amino acids in length (see Results and discussion).
Results and discussion
A comparison of the number of pufferfish hits by hmmsearch results versus the pufferfish database both before and after the THoR process
Domain name (SMART name)
N(THoR) - N(SMART)
14-3-3 homologs (14_3_3)
Domains in Ataxins and HMG-containing proteins (AXH)
Breast cancer carboxy-terminal domain (BRCT)
Bromo domain (BROMO)
Bulb-type mannose-specific lectins (B_lectin)
Chromatin organization modifier domain (CHROMO)
Calpain-like thiol protease family (CysPc)
Tandem repeat (DM15)
Receptor for Ubiquitination Targets (FBOX)
Formin homology 2 domain (FH2)
Fibronectin type 1 domain (FN1)
High mobility group (HMG)
Protein kinase C-related kinase homology region 1 homologs (HR1)
Short calmodulin-binding motif containing conserved Ile and Gln residues (IQ)
Kyprides, Ouzounis, Woese motif (KOW)
Zinc-binding domain present in Lin-11, Isl-1, Mec-3 (LIM)
Pleckstrin homology (PH)
Zinc finger (PHD)
Phosphoinositide 3-kinase, region postulated to contain C2 domain (PI3K_C2)
Motif in proteasome subunits, Int-6, Nip-1 and TRIP-15 (PINT)
Phosphatidylinositol phosphate kinases (PIPKc)
Domain found in a protein subunit of human RNase MRP and RNase P ribonucleoprotein complexes and archaeal proteins (POP4)
Domain found in Plexins, Semaphorins and Integrins (PSI)
Domain with conserved PWWP motif (PWWP)
Guanine nucleotide exchange factor for Rho/Rac/Cdc42-like GTPases (RhoGEF)
Src homology 2 domains (SH2)
Src homology 3 domains (SH3)
Staphylococcal nuclease homologs (SNc)
Domain in short gastrulation protein and chordin (SOG)
snRNP Sm proteins (Sm)
Tetratricopeptide repeats (TPR)
Tudor domain (TUDOR)
Domain present in VPS-27, Hrs and STAM (VHS)
Only one candidate false positive was predicted in our benchmarking test. Out of a total of 3,519 domain homologs found by THoR, one sequence (SINFRUP00000075480 amino acids 159 to 242) could not be validated independently as a PINT domain through the use of PSI-BLAST and SMART. Similarly, it could not be proved that this sequence was not a homolog. The alignment of this sequence and SMART PINT domains is provided as additional data (see Additional data file).
Comparison of hmmsearch results and THoR results revealed false negatives, defined as domains identified using the SMART hmmsearch but not found by THoR, only when the starting alignment was shorter that 35 to 40 amino acids (Table 1). Short repeats are often difficult to identify using PSI-BLAST because of its inappropriate use of optimal alignment statistics for suboptimal alignments  as well as the general problem of low alignment scores. As THoR is reliant on PSI-BLAST for homolog detection, it is not surprising that it shares its limitations. Consequently, THoR is only appropriate for domains or repeats with lengths exceeding 40 amino acids. For short repeats or domains hmmsearch would prove to be a better tool for domain homolog identification.
The caveats of short alignments and large sets of results highlight two principles that must be applied in order to gain the greatest benefit from the use of THoR. First, the quality of the starting alignment greatly affects the quality of the results that are received, which is also the case for hmmsearch searches. Results from alignments containing many sequences require a long time to complete because there are more PSI-BLAST searches to perform and because these require significantly more HSP-H to HSP-ψ comparisons. The extra information gained from having more members in the starting alignment is, in most cases, offset by the extra time required for calculations. Therefore, it is suggested that in most cases an alignment that is nonredundant at a level of 50-80% sequence identity will be sufficient to produce an improved set of results. This approach reduces the likelihood of encountering memory errors, as for the cases of the AAA domain and WD40 repeat. Similarly, it is important that the target database is as nonredundant as possible . When analyzing domains that are known to have large numbers of homologs within the database (such as AAA) it may even be necessary to use a database that is nonredundant at a level of 50% identity, in order to generate results that are of a size that can be processed within a reasonable period of time.
The THoR package provides an easy-to-use interface to both of the search methods and the results are provided in a simple, intuitive format. The web interface has been designed to provide easy access to as much information as possible, including annotation and taxon information. In order to facilitate the evaluation of large amounts of data, all the results for each entry in the new alignment are displayed in a condensed format, which contains the essential information such as sequence identifiers and HSP-H and HSP-ψ locations. Access to the complete dataset, such as all of the PSI-BLAST results for each entry and all of the queries that identified that sequence, is through a series of links that either open hidden sections of the page for viewing or open a new, smaller window. This is a very effective method for data management, which provides straightforward and transparent quality control of sequence information.
Materials and methods
THoR uses PSI-BLAST and HMMER algorithms and Perl scripts, integrated within a PHP-based interface supplied through the Apache web-server . Although the THoR package has been designed to operate on the Linux platform, it is likely to run on most Unix-based operating systems and possibly on a Windows-based system with some minor modifications.
Locations from which the programs and files required by THoR can be downloaded
The NCBI nr database
The THoR package is provided with installation instructions and Perl installation scripts on the Oxford SMART website . THoR will run on a Linux box capable of supporting the software and at least a 10 Gb hard disk drive. The whole package has been successfully installed and run on a single-processor 300 MHz machine with 128 Mb RAM. However, as the PSI-BLAST searches are computer-intensive and the results can be large, it is recommended that the package is installed on a dual-processor Linux machine with at least 512 Mb RAM and 40 Gb of hard disk storage space.
Additional data files
The alignment of SMART PINT domains is available (Additional data file 1).
We would like to thank Leo Goodstadt for helpful discussions concerning the optimization of Perl code.
- Ponting CP, Russell RR: The natural history of protein domains. Annu Rev Biophys Biomol Struct. 2002, 31: 45-71. 10.1146/annurev.biophys.31.082901.134314.PubMedView ArticleGoogle Scholar
- Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res. 2002, 30: 276-280. 10.1093/nar/30.1.276.PubMedPubMed CentralView ArticleGoogle Scholar
- Letunic I, Goodstadt L, Dickens NJ, Doerks T, Schultz J, Mott R, Ciccarelli F, Copley RR, Ponting CP, Bork P: Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res. 2002, 30: 242-244. 10.1093/nar/30.1.242.PubMedPubMed CentralView ArticleGoogle Scholar
- Hofmann K: Sensitive protein comparisons with profiles and hidden Markov models. Brief Bioinform. 2000, 1: 167-178.PubMedView ArticleGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14: 755-763. 10.1093/bioinformatics/14.9.755.PubMedView ArticleGoogle Scholar
- Ponting CP, Schultz J, Copley RR, Andrade MA, Bork P: Evolution of domain families. Adv Protein Chem. 2000, 54: 185-244.PubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C: Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol. 1998, 284: 1201-1210. 10.1006/jmbi.1998.2221.PubMedView ArticleGoogle Scholar
- Salamov AA, Suwa M, Orengo CA, Swindells MB: Combining sensitive database searches with multiple intermediates to detect distant homologues. Protein Eng. 1999, 12: 95-100. 10.1093/protein/12.2.95.PubMedView ArticleGoogle Scholar
- Jones DT, Swindells MB: Getting the most from PSI-BLAST. Trends Biochem Sci. 2002, 27: 161-164. 10.1016/S0968-0004(01)02039-4.PubMedView ArticleGoogle Scholar
- Walker DR, Koonin EV: SEALS: a system for easy analysis of lots of sequences. Proc Int Conf Intell Syst Mol Biol. 1997, 5: 333-339.PubMedGoogle Scholar
- Bork P, Gibson TJ: Applying motif and profile searches. Methods Enzymol. 1996, 266: 162-184.PubMedView ArticleGoogle Scholar
- Ponting CP, Birney E: Identification of domains from protein sequences. Methods Mol Biol. 2000, 143: 53-69. 10.1385/1-59259-368-2:53.PubMedGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.PubMedView ArticleGoogle Scholar
- Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-562. 10.1038/nature01262.PubMedView ArticleGoogle Scholar
- Aparicio S, Chapman J, Stupka E, Putnam N, Chia JM, Dehal P, Christoffels A, Rash S, Hoon S, Smit A, et al: Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science. 2002, 297: 1301-1310. 10.1126/science.1072104.PubMedView ArticleGoogle Scholar
- Andrade MA, Ponting CP, Gibson TJ, Bork P: Homology-based method for identification of protein repeats using statistical significance estimates. J Mol Biol. 2000, 298: 521-537. 10.1006/jmbi.2000.3684.PubMedView ArticleGoogle Scholar
- Holm L, Sander C: Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics. 1998, 14: 423-429. 10.1093/bioinformatics/14.5.423.PubMedView ArticleGoogle Scholar
- The Apache Software Foundation. [http://www.apache.org]
- Goodstadt L, Ponting CP: CHROMA: consensus-based colouring of multiple alignments for publication. Bioinformatics. 2001, 17: 845-846. 10.1093/bioinformatics/17.9.845.PubMedView ArticleGoogle Scholar
- The Free Software Foundation. [http://www.gnu.org]
- Thorough Homology Resource. [http://smart.ox.ac.uk/thor]
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.