CPSARST: an efficient circular permutation search tool applied to the detection of novel protein structural relationships
© Lo et al.; licensee BioMed Central Ltd. 2008
Received: 11 September 2007
Accepted: 18 January 2008
Published: 18 January 2008
Circular permutation of a protein can be visualized as if the original amino- and carboxyl termini were linked and new ones created elsewhere. It has been well-documented that circular permutants usually retain native structures and biological functions. Here we report CPSARST (Circular Permutation Search Aided by Ramachandran Sequential Transformation) to be an efficient database search tool. In this post-genomics era, when the amount of protein structural data is increasing exponentially, it provides a new way to rapidly detect novel relationships among proteins.
Circular permutation (CP) in a protein structure is the rearrangement of the amino acid sequence such that the amino- and carboxy-terminal regions are interchanged [1, 2]. It can be visualized as if the original termini of the polypeptide were linked and new ones created elsewhere [3, 4]. Since the first observation of naturally occurring circular permutations in plant lectins , a substantial number of natural examples have been reported, including some bacterial β-glucanases, swaposins, glucosyltransferases, β-glucosidases, SLH domains, transaldolases, C2 domains (for a review, see ), FMN-binding proteins , double-φ β-barrels , glutathione synthetases , DNA and other methyltransferases [1, 10], ferredoxins , and proteinase inhibitors [12, 13]. In most of the cases, circular permutants (CPs) have conserved function or enzymatic activity [6, 14], sometimes with increased functional diversity [15–17].
To reveal the influences of CP on the structure, function and folding mechanism of proteins, many artificial CPs have been generated, inclusive of trypsin inhibitor, anthranilate isomerase, dihydrofolate reductase, T4 lysozyme, ribonucleases, aspartate transcarbamoylase, the α-spectrin SH3 domain, the Escherichia coli DsbA protein, ribosomal protein S6 and Bacillus β-glucanase [18, 19]. The outcomes have indicated that three-dimensional structure seems remarkably insensitive to CP  and CPs generally retain their biological functions [3, 4], although the structural stabilities, the folding nuclei, transition states or pathways might be altered [18, 20, 21]. Since CP generally preserves protein structure and function, with sometimes increased stability or activity, it has been applied to trigger crystallization , improve enzyme activities , determine critical elements [23, 24], and create novel fusion proteins, the tethered sites of which are not confined to the native termini [25–28], such as the famous fluorescent calcium sensor .
In spite of these interesting properties and applications, there is still much uncertainty about the genetic mechanisms, the evolutionary importance and the natural prevalence of CP [6, 18, 29, 30]. CPs can arise from posttranslational modifications [5, 31] but a majority may arise from genetic events . There have been several genetic and evolutionary mechanisms proposed, for instance, duplication/deletion models [6, 32], duplication-by-permutation models [1, 33], fusion/fission models [2, 30], and plasmid-mediated 'cut and paste' . However, which plays the major role or what proportion each mechanism contributes to the evolution of CPs and protein families remains uncertain. Besides, because of the disagreement between definitions of CPs, conflicting conclusions can be observed. In general, previous studies that considered the whole protein as the unit that undergoes CP concluded that CP is rare in nature [6, 14, 30] while those viewing the domain as the unit that undergoes CP suggested CP to be frequent [1, 29, 34].
In this post-genomic era, the amount of protein structure data is increasing exponentially, and plenty of information should be extractable to reveal the natural prevalence and evolutionary mechanism of CP; however, CP search tools are still very rare. It has been indicated that traditional sequence comparison methods are linearly sequential in nature and inefficient at identifying CP [6, 35]. Three-dimensional structural comparisons may identify more evolutionarily far-related CPs ; nevertheless, conventional methods such as DALI  and CE  are also inefficient due to their sequential nature . To detect CP, the most exact approach is to use an algorithm that generates all possible CPs of one protein and subsequently aligns them with another protein to find an alignment better than the linear alignment [2, 38], although this is apparently very time-consuming. A few brilliant approaches have been developed to achieve higher efficiency. Uliel et al. [30, 38] proposed a heuristic method based on duplicating one of the two protein sequences followed by manual verifications. Though being much faster, it still takes several CPU months to survey tens of thousands of sequences. The requirement of manual examinations also makes it unrealistic for searching large datasets . Weiner et al.  condensed amino acid sequences into tiny domain strings to achieve an extremely high speed, scanning hundreds of thousands of sequences in hours; however, without suitable domain annotations or when a CP disrupts a domain, false negatives occur. Structural alignment methods applicable to the identification of CPs have also been developed. For instance, Jung and Lee  developed SHEBA to screen the SCOP database. They suggested that CPs are very frequent and many have symmetric structures. However, since internal symmetry may introduce noise into the detection of CPs , certain false positive predictions can be produced. Regardless of the capability of detecting distantly related CPs, a pair-wise comparison by structure-based CP-detecting algorithms may take from seconds to minutes , making routine database searches infeasible.
Overview of CPSARST
Performance on random circular permutants
The curve of the percentage distance of CPSARST has a half hyperbolic shape (Figure 2b). Provided that the sequence identity is > 20%, the percentage distance will be < 1%. Combining these data, we suggest that when our approach is applied to amino acid sequences, it will be reliable in detecting CPs with sequence identities as low as about 20%.
Accuracy evaluations with engineered circular permutants
Retrieved parent proteins of engineered CPs by CPSARST
Parent structure/recorded CP site
Retrieved structure/determined CP site
Circularly permuted (1-3,1-4)-beta-D-glucan 4-glucanohydrolase H
Circularly permuted (1-3,1-4)-beta-D-glucan 4-glucanohydrolase H
Circularly permuted BB2-crystallin
Spectrin alpha chain
Circularly permuted core-streptavidin E51/A46
Circularly permuted core-streptavidin E51/A46
Thiol-disulfide interchange protein
Pair-wise comparisons of naturally occurring circular permutants
Performance of pair-wise comparisons for natural candidate CP pairs over various protein sizes
Length of the query protein (residues)
No. of candidate CP pairs
Average running time (s)
Average running time (s)
Average running time (s)
Performance of pair-wise comparisons for natural candidate CP pairs over various sequence identities
No. of candidate CP pairs
CPSARST runs very rapidly in pair-wise comparisons. When searching databases, its speed will be even higher since it does not work in a pair-wise manner but with a 'double filter-and-refine' strategy. Chen had estimated that using SAMO to compare two proteins mostly took around ten seconds . Searching the current PDB (approximately 90,000 polypeptides) by one-against-all comparisons will, therefore, require over 15,000 minutes. However, CPSARST can do this one-against-all comparison in 1.7 minutes (see below). As shown by these naturally occurring cases, CPSARST achieves a high speed with a reasonable compromise in alignment accuracy.
Protein structural database searches
Statistics of protein structural database searches
No. of proteins
No. of candidate pairs
Detected by amino acid sequence
Detected only by Ramachandran string
Confirmed after the refinement stage
Total no. of protein pairs
208.0 × 106
136.6 × 106
Total running time (minutes)
No. of protein pairs scanned per minute
As a database search method, CPSARST provides a list of hits ranked by the statistically meaningful E-value. Given that a hit has a similarity score S, the E-value is the number of different alignments with scores equivalent to or better than S that are expected to occur in this particular database search by chance [45–47]. A lower E-value indicates a higher significance for the score. This statistical significance is a useful indicator of the reliability of the search results.
To determine the extent to which two proteins are related by a CP, we used the CP scoring scheme described by Vesterstrom and Taylor . The minimum value of this CP score is -1 for a pair of completely linearly aligned proteins, and its maximum value is 1 for a perfect CP alignment. In general, a small positive CP score indicates that only a small fraction of the protein is permutated while a larger one reveals that the CP site is closer to the middle of the polypeptide chain.
In the survey of nrPDB-90 and nrSCOP-90, we had set the RMSD cutoff as 5 Å, the E-value cutoff as 0.1 and the CP score threshold as 0.2. Under these criteria, 2,911 and 4,228 candidate pairs were identified in nrPDB-90 and nrSCOP-90, respectively. For nrPDB-90, the 2,911 candidate pairs consisted of 1,822 different polypeptides, that is 12.6% (1,822 of 14,422) of the polypeptides have CP relationships with at least one other polypeptide. For nrSCOP-90, the proportion is 17.6% (2,060 of 11,688).
Novel circular permutation family detected by CPSARST
Circular permutants detected by CPSARST
We examined the candidate pairs detected by CPSARST with RMSD ≤ 3.5 Å by visual inspection of superimposed structures and found that approximately 55%, 25% and 20% are mainly alpha, mainly beta, and alpha-beta structures, respectively. These CP pairs are listed, each with a superimposed image, in Additional data file 3; many well-known CP cases are listed, such as some lectins, glucanases, transaldolases, methyltransferases, ferredoxins, protease inhibitors and GTPases. Furthermore, a large number of these CP relationships have not been reported yet, for example, chorismate mutases ([PDB:1CSM] versus [PDB:2AO2]); some (approximately 20%) even involve hypothetical proteins, implying that CPSARST can be applied to suggest possible functions for hypothetical proteins.
Top 20 CP relationships detected from the nrPDB-90 dataset for hypothetical protein YlqF*
Ras-related protein Rab-27B protein YPT7P
GTP-binding protein REM 1
Ras-related protein Rab-7
Ras-related protein Rab-18
GTP-binding protein, GTPase domain
Ras-related protein Rab-30
Ras-related protein Rab-3D
Ras-related protein Rab-22A
Probable GTP-binding protein engB
Ras-related protein Rab-4B
Ras-related protein Ral-A
Rab5C, GTPase domain
Ras-related protein Rab-43
ADP-ribosylation factor-like protein 3
Ras-related protein Rab-9B
Ras-related protein Rab-11A
Ras-related protein R-Ras
Rab14, member Ras oncogene family
Detecting circular permutants with low sequence identities
Generally speaking, although protein similarity search methods based on amino acid sequence alignments are much faster than those based on structural comparisons, they are less sensitive in detecting remote homology . In the case of detecting CP, sequence-based methods have met great challenges because of the evolutionary complexity and diversity of circular permutants. Except the post-translational modification model, all the other proposed mechanisms for CP involve at least two stages of genetic modifications in evolution (see Background), implying that the formation of CP may require a long period during which other common mutations (substitutions, insertions and deletions) can accumulate to such an extent that the circular permutants have much diverged from the parent protein in sequence. Therefore, sequence-based methods may be limited in identifying distantly related CPs. For instance, Uliel et al. used an amino acid sequence-based heuristic algorithm to screen the entire Swiss-Prot database (version 34.0; approximately 80,000 proteins) and the Pfam database  for CP pairs, and identified only 32 cases . However, in the same year, Jung and Lee  used a structure-based algorithm to survey a protein dataset (3,035 domains) collected from SCOP and reported that approximately 47% (1,433 of 3,035) of the domains each had at least one circular permutant. Furthermore, they discovered that less than 0.3% of the abundant symmetric CPs have > 30% sequence identities. Although this large difference is partially caused by the fact that Uliel et al. used more stringent criteria to identify CP, it basically indicates that amino acid sequence-based methods can miss many distantly related CPs .
Among the CP candidate pairs detected by CPSARST in nrSCOP-90, 27.5% can be considered as symmetric CPs (Table 4). Similar to the observation of Jung and Lee, few of these symmetric CPs (2.6%) have sequence identities > 30%. Furthermore, although 91% of the naturally occurring CP pairs listed in Table 2 have sequence identities ≤ 20%, CPSARST shows good performance when compared with other structure-based methods. These data demonstrate that CPSARST is able to detect CPs with low sequence identities.
In most cases, it is not easy to achieve high accuracy and speed simultaneously for a database search method; instead, some compromising balance is usually reached. Judging from the fact that using previous structure-based CP-detecting methods such as SAMO to search the current PDB requires more than 15,000 minutes  per query, it is reasonable that speed should be weighted more than accuracy in the field of CP searching, especially in this post-genomic era when the amount of protein structural data is increasing rapidly. CPSARST has been shown to achieve accuracy substantially higher than sequence-based UFAU (Figure 2) and comparable to structure-based SAMO (Tables 2 and 3); as to the speed, it can scan 52,800 database proteins per minute (Table 4), approximately 4 and 8,824 times faster than UFAU and SAMO, respectively. This improvement in speed is achieved by two features: it transforms three-dimensional information of protein structures into one-dimensional text strings and, thus, converts structural comparison problems into text sequence alignment problems, which can be solved much more rapidly; and, in both the screening and refinement stages, CPSARST does not stick to the absolute qualities of the alignments. By focusing on the relative qualities between two rounds of alignments, it can rapidly sieve out useful information. We call this strategy 'double filter-and-refine'. Here we propose that it is efficient, flexible and applicable to other biological research fields, especially where the data analyses require large-scale computational power.
The prevalence and definition of circular permutation
Previous studies have made conflicting conclusions; some presumed that CP is rare in nature [6, 14, 30] - approximately 5% as indicated by Vogel and Morea  - while others supposed that CP is frequent [1, 29, 34] - approximately 47% as estimated by Jung and Lee . In our observation, studies based on structural analyses usually discovered more CPs than sequence-based ones; besides, studies that consider the whole protein as the unit that undergoes CP would conclude that CP is rare whereas those viewing the domain as the unit that undergoes CP would suggest CP to be frequent.
As we have discussed, it is reasonable that more cases of CP are detected by structural comparison than by amino acid sequence alignment. However, although proteins with similar structures are usually functionally related , when a pair of structurally and functionally similar proteins share extremely low sequence identity, we still cannot exclude the possibility that they are just the products of convergent evolution [56–58] and do not share the same origin. In the case of identifying CP, it is noteworthy that even if a pair of proteins shows a high extent of CP topologically, it does not directly mean that an evolutionary CP event has indeed taken place. Therefore, we argue that detecting CP only by structure would result in too many false positives when judged from the point of view of molecular evolution. This is why we have set up a user-adjustable sequence identity filter in the web service of CPSARST  (see Materials and methods). When this filter was not enabled, the prevalence of CP estimated by CPSARST was 12.6-17.6% (see Results). When we considered that a real CP should have a higher sequence identity in the CP alignment than in the linear alignment, around one-fourth of the candidate pairs counted in Table 4 was filtered out, lowering the estimated prevalence of CP to 9.0-13.0%.
The fact that the frequency of CP estimated by CPSARST is only one-third of that estimated by Jung and Lee  is probably because of the more stringent criteria used by CPSARST. We set the RMSD cutoff as 5 Å, the CP score threshold as 0.2 and the least permutation size as 20% for a pair of proteins to be considered as CP candidates; similar criteria were not seen in the report of Jung and Lee. Also, considering their methodology, there is a large likelihood that proteins containing repeats and duplications are regarded as CPs, many of which have been treated as false cases by Uliel et al.  and us (see Materials and methods). When we loosened the criteria to 6 Å (RMSD cutoff), 0.1 (CP score threshold) and 10% (least permutation size), and did not filter out proteins containing repeats, the CP prevalence estimated by CPSARST was 34.7-36.7% (see Additional data file 5 for statistics), similar to Jung and Lee's estimation. However, since they did not provide any supplementary list of their CP candidates, we are unable to check our speculation.
To our knowledge, all the currently available CP-detecting methods are more sensitive to global CP (the unit undergoing CP is the whole protein) than partial CP (the CP is within a region of the protein), as is CPSARST. To detect partial CP, domain databases such as SCOP and Pfam are usually used as the target databases instead of the PDB and Swiss-Prot. Although considering the domain as the unit undergoing CP, that is, partial CP, can identify more candidates (as shown in Table 4), some scientists have argued that these cases should be considered as 'swaps' rather than CPs . This controversy is another cause of the conflicting conclusions about the prevalence of CP.
To sum up, despite the conflicting conclusions made by previous studies, there seem to be rational explanations for this situation. We suppose that the identification of CPs requires a precise definition of CP depending on the purpose of the study. In our opinion, if evolutionary importance and mechanisms are concerned, global CP with reasonable sequence identity limitation will be suitable, while partial CP without limitation of sequence identity in the definition may help scientists to discover novel functional relationships among proteins and to reveal the principles of protein folding.
Possible applications of CPSARST
The performance of CPSARST suggests that it is an efficient approach to the detection of CPs in large protein structural datasets; routine bank-against-bank searches are thus achievable. The multiple indexes produced by CPSARST, for example, the structural similarity score, statistically meaningful E-value, sequence identity, alignment size, RMSD and CP score, are beneficial to develop automated procedures such as a functional assignment system for novel hypothetical proteins. Also, information retrieved by bank-against-bank searches can be organized into a CP database.
Since the first observation of CP in plant lectins , many natural and artificial cases have been studied and several CP detecting methods have been developed; however, there is still no CP database and no standard procedure for evaluating CP detection methods. We suppose that a well-organized CP database will help move this field forward. It could provide a standard for the evaluations of CP-related programs, such as CP search tools and predictors of viable CP sites , and provide information to reveal the evolutionary mechanisms of CP.
CP has been applied to X-ray crystallography , modification of enzymes , creation of novel fusion proteins [25, 28], and construction of protein switches and sensors [26, 27]. All these applications depend on a proper choice of position to create CP. A CP database offering plenty of materials for the discovery of the rules by which Nature selects CP sites should be advantageous to the technical applications of CP.
Although interesting, there is still much uncertainty about the evolutionary mechanisms and importance of CP [6, 18, 29, 30]. Weiner et al.  have proposed that the frequency of incomplete or intermediate CP may help determine the major mechanism of CP. The 'double filter-and-refine' strategy of CPSARST is very flexible. With extended boundary criteria, CPSARST can specifically detect incomplete or intermediate CP. The ability of CPSARST to perform rapid bank-against-bank searches by structural comparisons gives it the potential to reveal how, why and to what extent Nature achieves protein evolutionary and functional diversity by using CP.
We have developed an efficient circular permutation search method, CPSARST, which linearly encodes protein structures as text strings and achieves a structural similarity searching speed thousands of times as high as related algorithms. When tested with engineered CPs, CPSARST successfully retrieved all the natural proteins with accurate permutation site predictions. Its ability to identify natural CPs is also comparable to other structure-based CP-detecting methods. Its high efficiency makes routine database surveys and bank-against-bank searches achievable. After all-against-all searches of non-redundant PDB and SCOP, we have found that most candidate CP pairs share sequence identity < 20%, explaining why previous sequence-based CP-detecting methods have identified much less CP cases than structure-based algorithms. Based on these search results, we have suggested that the identification of CPs requires a suitable definition of CP depending on the purpose of the study. If global CP with reasonable sequence identity limitation is considered as true CP, the prevalence of CP in protein structural databases is estimated to be 16% by CPSARST, whereas the prevalence of partial CP without limitation of sequence identity in the definition is estimated to be 36%. Several new CP cases have been detected and reported here, inclusive of a novel CP family consisting of microbial GBBPs, molybdate-binding proteins and a cysteine regulon transcriptional activator. In this post-genomics era, when the amount of protein structural data is increasing exponentially, CPASRST can provide a new way to rapidly detect novel relationships among proteins and help to reveal how Nature achieves protein evolutionary and functional diversity by using CP. Its web service and stand-alone Java program are available at .
Materials and methods
All the developments and experiments were performed on an IBM e-server 336 machine with dual 3.2GHz Intel processors, 1 GB RAM and linux operating system.
Linear encoding of protein structures
CPSARST describes three-dimensional protein structures as one-dimensional strings by using a RST algorithm . The torsion angles (φ,δ) of a number of proteins were plotted onto a 10° × 10° dissected RM map. The 1,296 cells on this map were then clustered into 22 groups by nearest-neighbor clustering  based on their spot numbers and angular distances. These groups were assigned a set of English letters called 'Ramachandran codes'. Coordinates of a protein structure could be accordingly transformed into a text string. The scoring matrix for these codes was produced by using a 'regenerative approach' . This linear encoding system converts complicated and time-consuming structural comparison problems into sequence comparisons, which can be done very rapidly. It has been applied to protein structural similarity searching and achieved speeds hundreds of thousands of times higher than CE with an acceptable compromise of accuracy . The structural string generated by RST is different from the amino acid sequence in nature; therefore, we termed it 'Ramachandran sequence' or 'Ramachandran string'.
Generation and analyses of random circular permutants
A hundred polypeptide sequences each longer than 100 residues and sharing < 40% sequence identities were randomly selected from the PDB to perform in silico circular permutations. Regular mutations, i.e. substitutions, insertions and deletions, were introduced in the ratio 150:1:1 to generate random CPs, resulting in 100 levels of decreasing sequence identities/similarities for every polypeptide sequence. The collection of these computer-generated random CPs is called the RCP dataset.
The substitution rates of various amino acids used to generate the RCP dataset were calculated by analyzing a large number of multiple alignment blocks, the sequences of which shared < 45% identity, as described previously . Since every sequence in the RCP dataset was evolved independently to avoid any possible bias, we supposed that it is suitable for the evaluation of CP detection methods. RCP has two subsets, the identity subset and similarity subset, each containing 10,000 CP pairs (100 parent sequences × 100 circular permutants). They are listed in Additional data file 6.
Comparisons between each parent sequence and its CPs in the identity subset of the RCP dataset were performed by the traditional heuristic method blast . Two parameters were monitored to assess the performance: the percentage of cases in which the exact permutation site was retrieved; and the average percentage distance of the found permutation site to the exact one (see Results). Another two parameters were monitored to optimize the filter for RM sequence searches: the ratio of similarity scores and the negative logarithm in base 10 (-log10) of the E-value ratios, before and after the duplication of query sequences (see Additional data file 7 for the results). We found that all the score ratios are equal to or higher than 1, indicating that when the sequence of a CP is duplicated (DL), it always aligns to its parent sequence better than the normal length (NL). As to the E-value ratios, that is, -log10(E-value DL /E-value NL ), approximately 80% of them are larger than 2, which stands for a 102-fold improvement of the significance of the similarity score after duplicating the query sequence (see Results for detailed information about E-values).
Screening of circular permutant candidates
Refinement of the search results
The refinement of search results of RM sequences were performed by FAST, an accurate structural alignment algorithm [40, 63] and a CP scoring scheme developed by Vesterstrom and Taylor , following these steps. Step 1: for each candidate, the putative permutation site is parsed from the alignment result of the duplicated query string. Step 2: performing two rounds of FAST structural alignments. The first round is a normal linear alignment. In the second round, the circularly permuted alignment, the PDB file of the query structure was manipulated by exchanging the amino- and carboxy-terminal halves according to the putative permutation site so that FAST will do the structural alignment 'backside first'. Step 3: if the FAST alignment size after CP is no larger than 50% of the smaller size of the query and subject proteins, it is screened out. Step 4: the RMSD cutoff of the CP alignment is set as 5 Å. Step 5: in order to differentiate true CP from protein with internal repeats or duplications, two criteria have been set: the alignment size of the CP alignment should be larger than that of the linear alignment; and the FAST similarity score  or TOP score  (see formula (2)) calculated from the CP alignment should gain at least 25% improvement over the linear alignment. Step 6: the CP score  was calculated from the aligned positions by FAST. It has a theoretical minimum value of -1 (a completely linear alignment) and a maximum value of 1 (a perfect CP). Although Vesterstrom and Taylor suggested that an alignment with this CP score higher than 0.25 can be considered as a significant CP, we find that 0.2 is still suitable in our multi-filter system. Step 7: the putative CP site is refined by parsing the output of FAST structural alignment.
Pair-wise circularly-permuted structural alignments
The procedure of the database search tool CPSARST can be simplified to perform pair-wise structure alignments as follows. First, transform the query and subject protein structures into RM sequences Q and S, respectively. Second, duplicate Q string to QQ, and align it to S. Third, find the best local alignment and trace it back to the 'start point', which is the putative permutation site. For example, if in the best local alignment, the fragment between residues q1 and q2 of Q is aligned to the fragment between s1 and s2 of S, then the permutation site of Q will be traced back to q1 - s1 + 1. Fourth, introduce a CP into the query structure according to the putative CP site. Compare this new structure with the subject protein by using an accurate structural alignment algorithm mentioned above.
CPSARST basically works on the structurally meaningful RM strings transformed by RST; however, since there have been many errors and inconsistencies reported in PDB entries , a few polypeptides (approximately 2%) cannot be successfully transformed into RM strings. Therefore, in the implementation of CPSARST, we have added two extra rounds of amino acid sequence alignment searches, one by the normal length and the other by the duplicated sequence, prior to the RM string searches. Besides, the sequence homology filter can be enabled to guarantee a higher evolutionary significance of the search results (see Discussion), and several parameters are adjustable by the users according their needs or the property of materials.
Word size and gap penalties
These are traditional parameters used by sequence alignment search tools such as BLAST  and FASTA . For CPSARST, a smaller word size can provide a more accurate determination of the CP site while taking more running time. In our experience, lower gap penalties can give CPSARST higher sensitivity, although there is a trade-off for running time, too. Generally speaking, these parameters have only minor effects on the performance.
Permutation size limit and circular permutation score threshold
It has been supposed that a tiny permutation size is unlikely a real CP , but there is yet no common conclusion made for the generally suitable permutation size limit. Setting a large limit ensures that CPSARST identifies unambiguous CP relationships; however, novel cases can thus be missed. If the query protein is large enough, for example, > 150 residues, a small size limit such as 10% may still work well, but we would like to suggest a 15% limit for general situations. The CP score threshold has similar effects on the search quality of CPSARST to the permutation size limit (see Results and Materials and methods for further information).
RMSD cutoff and structural similarity improvement filter
Closer-related protein structures will have a lower RMSD when superimposed. This is also true for CPs. This cutoff can be used as a basic quality control in the same way as other conventional structural comparison tools. The normalized structural similarity score of FAST  is another basic quality control. Candidate pairs without enough improvement in structural similarity after CP can be screened out.
Examples of practical settings for these parameters can be found in Additional data file 8. CPSARST is available at .
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 lists the nrPDB-90 dataset, the 90% sequence identity subset of the PDB (January 2007). Additional data file 2 lists the nrSCOP-90 dataset, the 90% sequence identity subset of SCOP (v.1.71). Additional data file 3 is a table listing candidate CP pairs in the nrPDB-90 dataset detected by CPSARST with RMSD ≤ 3.5 Å. Additional data file 4 is a list of the structural neighbors of the hypothetical protein YlqF in PDB retrieved by DALI . Additional data file 5 is a table showing statistical results of protein structural database searches with broad criteria by CPSARST. Additional data file 6 lists the RCP dataset, a collection of 20,000 in silico random CPs. Additional data file 7 is a plot summarizing the score and E-value ratios calculated from the RCP dataset. Additional data file 8 is a list of the parameter settings used throughout this article.
Circular Permutation Search Aided by Ramachandran Sequential Transformation
glycine betaine-binding protein
Protein Data Bank
random circular permutation
root mean square distance
Ramachandran sequential transformation.
This work was supported by the National Science Council, Taiwan, ROC (NSC grant numbers: 95-3112-B-007-006 and 96-3112-B-007-006). We thank the authors of the BLAST and FAST algorithms, which were extensively used in this study.
- Jeltsch A: Circular permutations in the molecular evolution of DNA methyltransferases. J Mol Evol. 1999, 49: 161-164. 10.1007/PL00006529.PubMedView ArticleGoogle Scholar
- Weiner J, Thomas G, Bornberg-Bauer E: Rapid motif-based prediction of circular permutations in multi-domain proteins. Bioinformatics. 2005, 21: 932-937. 10.1093/bioinformatics/bti085.PubMedView ArticleGoogle Scholar
- Tsai LC, Shyur LF, Lee SH, Lin SS, Yuan HS: Crystal structure of a natural circularly permuted jellyroll protein: 1,3-1,4-beta-D-glucanase from Fibrobacter succinogenes . J Mol Biol. 2003, 330: 607-620. 10.1016/S0022-2836(03)00630-2.PubMedView ArticleGoogle Scholar
- Ribeiro EA, Ramos CH: Circular permutation and deletion studies of myoglobin indicate that the correct position of its N-terminus is required for native stability and solubility but not for native-like heme binding and folding. Biochemistry. 2005, 44: 4699-4709. 10.1021/bi047908c.PubMedView ArticleGoogle Scholar
- Cunningham BA, Hemperly JJ, Hopp TP, Edelman GM: Favin versus concanavalin A: circularly permuted amino acid sequences. Proc Natl Acad Sci USA. 1979, 76: 3218-3222. 10.1073/pnas.76.7.3218.PubMedPubMed CentralView ArticleGoogle Scholar
- Lindqvist Y, Schneider G: Circular permutations of natural protein sequences: structural evidence. Curr Opin Struct Biol. 1997, 7: 422-427. 10.1016/S0959-440X(97)80061-9.PubMedView ArticleGoogle Scholar
- Murzin AG: Probable circular permutation in the flavin-binding domain. Nat Struct Biol. 1998, 5: 101-10.1038/nsb0298-101.PubMedView ArticleGoogle Scholar
- Castillo RM, Mizuguchi K, Dhanaraj V, Albert A, Blundell TL, Murzin AG: A six-stranded double-psi beta barrel is shared by several protein superfamilies. Structure. 1999, 7: 227-236. 10.1016/S0969-2126(99)80028-8.PubMedView ArticleGoogle Scholar
- Polekhina G, Board PG, Gali RR, Rossjohn J, Parker MW: Molecular basis of glutathione synthetase deficiency and a rare gene permutation event. EMBO J. 1999, 18: 3204-3213. 10.1093/emboj/18.12.3204.PubMedPubMed CentralView ArticleGoogle Scholar
- Bujnicki JM: Sequence permutations in the molecular evolution of DNA methyltransferases. BMC Evol Biol. 2002, 2: 3-10.1186/1471-2148-2-3.PubMedPubMed CentralView ArticleGoogle Scholar
- Jung J, Lee B: Protein structure alignment using environmental profiles. Protein Eng. 2000, 13: 535-543. 10.1093/protein/13.8.535.PubMedView ArticleGoogle Scholar
- Antcheva N, Pintar A, Patthy A, Simoncsits A, Barta E, Tchorbanov B, Pongor S: Proteins of circularly permuted sequence present within the same organism: the major serine proteinase inhibitor from Capsicum annuum seeds. Protein Sci. 2001, 10: 2280-2290. 10.1110/ps.21701.PubMedPubMed CentralView ArticleGoogle Scholar
- Goldenberg DP, Creighton TE: Circular and circularly permuted forms of bovine pancreatic trypsin inhibitor. J Mol Biol. 1983, 165: 407-413. 10.1016/S0022-2836(83)80265-4.PubMedView ArticleGoogle Scholar
- Vogel C, Morea V: Duplication, divergence and formation of novel protein topologies. Bioessays. 2006, 28: 973-978. 10.1002/bies.20474.PubMedView ArticleGoogle Scholar
- Qian Z, Lutz S: Improving the catalytic activity of Candida antarctica lipase B by circular permutation. J Am Chem Soc. 2005, 127: 13466-13467. 10.1021/ja053932h.PubMedView ArticleGoogle Scholar
- Anantharaman V, Koonin EV, Aravind L: Regulatory potential, phyletic distribution and evolution of ancient, intracellular small-molecule-binding domains. J Mol Biol. 2001, 307: 1271-1292. 10.1006/jmbi.2001.4508.PubMedView ArticleGoogle Scholar
- Todd AE, Orengo CA, Thornton JM: Plasticity of enzyme active sites. Trends Biochem Sci. 2002, 27: 419-426. 10.1016/S0968-0004(02)02158-8.PubMedView ArticleGoogle Scholar
- Bulaj G, Koehn RE, Goldenberg DP: Alteration of the disulfide-coupled folding pathway of BPTI by circular permutation. Protein Sci. 2004, 13: 1182-1196. 10.1110/ps.03563704.PubMedPubMed CentralView ArticleGoogle Scholar
- Heinemann U, Hahn M: Circular permutations of protein sequence: not so rare?. Trends Biochem Sci. 1995, 20: 349-350. 10.1016/S0968-0004(00)89073-8.PubMedView ArticleGoogle Scholar
- Li L, Shakhnovich EI: Different circular permutations produced different folding nuclei in proteins: a computational study. J Mol Biol. 2001, 306: 121-132. 10.1006/jmbi.2000.4375.PubMedView ArticleGoogle Scholar
- Chen J, Wang J, Wang W: Transition states for folding of circular-permuted proteins. Proteins. 2004, 57: 153-171. 10.1002/prot.20175.PubMedView ArticleGoogle Scholar
- Schwartz TU, Walczak R, Blobel G: Circular permutation as a tool to reduce surface entropy triggers crystallization of the signal recognition particle receptor beta subunit. Protein Sci. 2004, 13: 2814-2818. 10.1110/ps.04917504.PubMedPubMed CentralView ArticleGoogle Scholar
- Anand B, Verma SK, Prakash B: Structural stabilization of GTP-binding domains in circularly permuted GTPases: implications for RNA binding. Nucleic Acids Res. 2006, 34: 2196-2205. 10.1093/nar/gkl178.PubMedPubMed CentralView ArticleGoogle Scholar
- Gebhard LG, Risso VA, Santos J, Ferreyra RG, Noguera ME, Ermacora MR: Mapping the distribution of conformational information throughout a protein sequence. J Mol Biol. 2006, 358: 280-288. 10.1016/j.jmb.2006.01.095.PubMedView ArticleGoogle Scholar
- Kojima M, Ayabe K, Ueda H: Importance of terminal residues on circularly permutated Escherichia coli alkaline phosphatase with high specific activity. J Biosci Bioeng. 2005, 100: 197-202. 10.1263/jbb.100.197.PubMedView ArticleGoogle Scholar
- Ostermeier M: Engineering allosteric protein switches by domain insertion. Protein Eng Des Sel. 2005, 18: 359-364. 10.1093/protein/gzi048.PubMedView ArticleGoogle Scholar
- Galarneau A, Primeau M, Trudeau LE, Michnick SW: Beta-lactamase protein fragment complementation assays as in vivo and in vitro sensors of protein protein interactions. Nat Biotechnol. 2002, 20: 619-622. 10.1038/nbt0602-619.PubMedView ArticleGoogle Scholar
- Baird GS, Zacharias DA, Tsien RY: Circular permutation and receptor insertion within green fluorescent proteins. Proc Natl Acad Sci USA. 1999, 96: 11241-11246. 10.1073/pnas.96.20.11241.PubMedPubMed CentralView ArticleGoogle Scholar
- Jung J, Lee B: Circularly permuted proteins in the protein structure database. Protein Sci. 2001, 10: 1881-1886.PubMedPubMed CentralView ArticleGoogle Scholar
- Uliel S, Fliess A, Unger R: Naturally occurring circular permutations in proteins. Protein Eng. 2001, 14: 533-542. 10.1093/protein/14.8.533.PubMedView ArticleGoogle Scholar
- Carrington DM, Auffret A, Hanke DE: Polypeptide ligation occurs during post-translational modification of concanavalin A. Nature. 1985, 313: 64-67. 10.1038/313064a0.PubMedView ArticleGoogle Scholar
- Ponting CP, Russell RB: Swaposins: circular permutations within genes encoding saposin homologues. Trends Biochem Sci. 1995, 20: 179-180. 10.1016/S0968-0004(00)89003-9.PubMedView ArticleGoogle Scholar
- Peisajovich SG, Rockah L, Tawfik DS: Evolution of new protein topologies through multistep gene rearrangements. Nat Genet. 2006, 38: 168-174. 10.1038/ng1717.PubMedView ArticleGoogle Scholar
- Chen L, Wu LY, Wang Y, Zhang S, Zhang XS: Revealing divergent evolution, identifying circular permutations and detecting active-sites by protein structure comparison. BMC Struct Biol. 2006, 6: 18-10.1186/1472-6807-6-18.PubMedPubMed CentralView ArticleGoogle Scholar
- Russell RB, Ponting CP: Protein fold irregularities that hinder sequence analysis. Curr Opin Struct Biol. 1998, 8: 364-371. 10.1016/S0959-440X(98)80071-7.PubMedView ArticleGoogle Scholar
- Holm L, Sander C: Protein structure comparison by alignment of distance matrices. J Mol Biol. 1993, 233: 123-138. 10.1006/jmbi.1993.1489.PubMedView ArticleGoogle Scholar
- Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998, 11: 739-747. 10.1093/protein/11.9.739.PubMedView ArticleGoogle Scholar
- Uliel S, Fliess A, Amir A, Unger R: A simple algorithm for detecting circular permutations in proteins. Bioinformatics. 1999, 15: 930-936. 10.1093/bioinformatics/15.11.930.PubMedView ArticleGoogle Scholar
- Vesterstrom J, Taylor WR: Flexible secondary structure based protein structure comparison applied to the detection of circular permutation. J Comput Biol. 2006, 13: 43-63. 10.1089/cmb.2006.13.43.PubMedView ArticleGoogle Scholar
- Lo WC, Huang PJ, Chang CH, Lyu PC: Protein structural similarity search by Ramachandran codes. BMC Bioinformatics. 2007, 8: 307-10.1186/1471-2105-8-307.PubMedPubMed CentralView ArticleGoogle Scholar
- CPSARST - Circular Permutation Search Aided by Ramachandran Sequential Transformation. [http://sarst.life.nthu.edu.tw/cpsarst]
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.PubMedPubMed CentralView ArticleGoogle Scholar
- Lu G: Top: A new method for protein structure comparisons and similarity searches. J Appl Crystallogr. 2000, 33: 176-183. 10.1107/S0021889899012339.View ArticleGoogle Scholar
- Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic Acids Res. 2004, 32: D189-192. 10.1093/nar/gkh034.PubMedPubMed CentralView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Altschul SF, Gish W: Local alignment statistics. Methods Enzymol. 1996, 266: 460-480.PubMedView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.PubMedView ArticleGoogle Scholar
- Pearson WR: Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol. 2000, 132: 185-219.PubMedGoogle Scholar
- Ostermeier C, Brunger AT: Structural basis of Rab effector specificity: crystal structure of the small G protein Rab3A complexed with the effector domain of rabphilin-3A. Cell. 1999, 96: 363-374. 10.1016/S0092-8674(00)80549-8.PubMedView ArticleGoogle Scholar
- The Dali Server. [http://www.ebi.ac.uk/dali/]
- Toms AV, Haas AL, Park JH, Begley TP, Ealick SE: Structural characterization of the regulatory proteins TenA and TenI from Bacillus subtilis and identification of TenA as a thiaminase II. Biochemistry. 2005, 44: 2319-2329. 10.1021/bi0478648.PubMedView ArticleGoogle Scholar
- Chen CC, Han Y, Niu W, Kulakova AN, Howard A, Quinn JP, Dunaway-Mariano D, Herzberg O: Structure and kinetics of phosphonopyruvate hydrolase from Variovorax sp. Pal2: new insight into the divergence of catalysis within the PEP mutase/isocitrate lyase superfamily. Biochemistry. 2006, 45: 11491-11504. 10.1021/bi061208l.PubMedView ArticleGoogle Scholar
- Sauder JM, Arthur JW, Dunbrack RL: Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins. 2000, 40: 6-22. 10.1002/(SICI)1097-0134(20000701)40:1<6::AID-PROT30>3.0.CO;2-7.PubMedView ArticleGoogle Scholar
- Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res. 2000, 28: 263-266. 10.1093/nar/28.1.263.PubMedPubMed CentralView ArticleGoogle Scholar
- Zuckerkandl E, Pauling L: Evolutionary divergence and convergence in proteins. Evolving Genes and Proteins. 1965, New York: Academic Press, 97-166.Google Scholar
- Krishna SS, Grishin NV: Structurally analogous proteins do exist!. Structure. 2004, 12: 1125-1127. 10.1016/j.str.2004.06.004.PubMedView ArticleGoogle Scholar
- Theobald DL, Wuttke DS: Divergent evolution within protein superfolds inferred from profile-based phylogenetics. J Mol Biol. 2005, 354: 722-737. 10.1016/j.jmb.2005.08.071.PubMedPubMed CentralView ArticleGoogle Scholar
- Ausiello G, Peluso D, Via A, Helmer-Citterich M: Local comparison of protein structures highlights cases of convergent evolution in analogous functional sites. BMC Bioinformatics. 2007, 8 (Suppl 1): S24-10.1186/1471-2105-8-S1-S24.PubMedPubMed CentralView ArticleGoogle Scholar
- Paszkiewicz KH, Sternberg MJ, Lappe M: Prediction of viable circular permutants using a graph theoretic approach. Bioinformatics. 2006, 22: 1353-1358. 10.1093/bioinformatics/btl095.PubMedView ArticleGoogle Scholar
- Weiner J, Bornberg-Bauer E: Evolution of circular permutations in multidomain proteins. Mol Biol Evol. 2006, 23: 734-743. 10.1093/molbev/msj091.PubMedView ArticleGoogle Scholar
- Jain AK, Dubes RC: Algorithms for Clustering Data. 1988, New Jersey: Prentice HallGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992, 89: 10915-10919. 10.1073/pnas.89.22.10915.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhu J, Weng Z: FAST: a novel protein structure alignment algorithm. Proteins. 2005, 58: 618-627. 10.1002/prot.20331.PubMedView ArticleGoogle Scholar
- Westbrook J, Feng Z, Jain S, Bhat TN, Thanki N, Ravichandran V, Gilliland GL, Bluhm WF, Weissig H, Greer DS, Bourne PE, Berman HM: The Protein Data Bank: unifying the archive. Nucleic Acids Res. 2002, 30: 245-248. 10.1093/nar/30.1.245.PubMedPubMed CentralView ArticleGoogle Scholar
- Horn C, Sohn-Bosser L, Breed J, Welte W, Schmitt L, Bremer E: Molecular determinants for substrate specificity of the ligand-binding protein OpuAC from Bacillus subtilis for the compatible solutes glycine betaine and proline betaine. J Mol Biol. 2006, 357: 592-606. 10.1016/j.jmb.2005.12.085.PubMedView ArticleGoogle Scholar
- Schiefner A, Breed J, Bosser L, Kneip S, Gade J, Holtmann G, Diederichs K, Welte W, Bremer E: Cation-pi interactions as determinants for binding of the compatible solutes glycine betaine and proline betaine by the periplasmic ligand-binding protein ProX from Escherichia coli . J Biol Chem. 2004, 279: 5588-5596. 10.1074/jbc.M309771200.PubMedView ArticleGoogle Scholar
- Schiefner A, Holtmann G, Diederichs K, Welte W, Bremer E: Structural basis for the binding of compatible solutes by ProX from the hyperthermophilic archaeon Archaeoglobus fulgidus . J Biol Chem. 2004, 279: 48270-48281. 10.1074/jbc.M403540200.PubMedView ArticleGoogle Scholar
- DeLano WL: The PyMOL Molecular Graphics System. 2002, San Carlos, CA: DeLano ScientificGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680. 10.1093/nar/22.22.4673.PubMedPubMed CentralView ArticleGoogle Scholar
- MDL Chime. [http://www.mdl.com/products/framework/chime/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.