The SeqFEATURE library of 3D functional site models: comparison to existing methods and applications to protein function annotation

SeqFEATURE, a tool for protein function annotation, models protein functions described by sequence motifs using a structural representation. The tool shows significantly improved performance over other methods when sequence and structural similarity are low.


Background
With the complete genomes sequenced for an increasing number of organisms, emphasis is shifting from identifying genes and gene products to understanding protein function and the interactions between biological entities on a systems level. Molecular-level descriptions of cellular physiology are critical for elucidating biological processes and manipulating them for medical or industrial purposes, such as bioremediation or drug design. In particular, the three-dimensional structures of proteinsprovide clues about their functions and how function may be manipulated by mutation or with small molecule chemicals. With protein structure determination becoming more efficient, the number of available structures is growing rapidly. The emergence of structural genomics [1], which aims to solve a representative set of proteins covering the entire space of naturally occurring structural folds, has spurred this growth, and depositions in the Protein Data Bank (PDB) [2] from structural genomics projects accounted for 16% of new structures in 2006, almost double the percentage in 2003 [3]. Structural genomics data and targets from almost all structural genomics centers are stored centrally in TargetDB [4], a database accessible from the PDB.
Because a major goal of structural genomics is to sample the entire protein structure space, many of the structural genomics centers target proteins with novel folds and low sequence identity to known proteins. The number of structures released per year by structural genomics has grown to almost 1,400 in 2007, with about 50% of these having less than 30% sequence identity to the rest of the PDB [3]. The consequence of this growth is that many of the new structures being deposited into the PDB lack functional annotation. Learning the functions of these new proteins will enable us to take best advantage of structural genomics efforts, but using conventional experimental methods can be tremendously time-consuming and expensive without testable hypotheses. As the field of structural genomics matures, the bottleneck from structure determination to functional annotation will become more pronounced. Automated function prediction programs would greatly alleviate this problem by providing clues and focusing investigation.
Many function prediction programs exist, ranging from those that describe general physicochemical properties of proteins to those that characterize functional domains or predict specific enzymatic activity. The majority of predictors use primary sequence, and the simplest method is to use a sequence alignment algorithm such as BLAST [5], since high sequence similarity is almost always indicative of evolutionary -and, often times, functional -conservation. Wilson et al. [6] showed that precise function can be transferred reliably above 40% and broad functional class above 25% sequence identity. In addition, many tools take advantage of curated databases, such as the manually inspected profile-Hidden Markov Models (HMMs) contained in the Pfam database of protein families [7], and PROSITE, which consists of manually built sequence patterns and profiles [8].
Both Pfam and PROSITE are contained within InterPro [9,10], a comprehensive, integrated resource for protein sequence information that provides many databases and tools for protein function and domain recognition. Among the tools offered are other HMM-based methods [11] such as HMMTigr, built on the TIGRFAMs database [12], and HMMPanther, built on the PANTHER database [13], both of which focus on function-based classification. Superfamily [14], another HMM-based tool hosted on InterPro, classifies sequences using manually curated models built from the Structural Classification of Proteins (SCOP) [15]. As a complement to Superfamily, Gene3D [16] is a semi-manually curated set of models built using the CATH protein structure classification [17].
Sequence-based tools often provide useful information about function, but they may be less well suited to cases where sequence identity is low. Under these circumstances, structure-based tools may detect functional signals that sequencebased methods are unable to capture due to sequence divergence. Since a protein's structure and function are inexorably linked, structure-based tools can abstract out those elements that are necessary for defining a particular function independent of the linear sequence, lending a degree of sensitivity and specificity that may improve over sequence-based tools. The abstractions can range in scale from entire secondary structure elements to residue or atom-based features. Function annotation based on structure is usually limited to recognition of either general folds or low-level molecular functions such as binding sites and active sites; it is unlikely routinely to predict the overall biological pathways and processes in which a protein participates. However, a complete understanding of structural environments and binding and active site properties provides a pyramid of evidence for the functional roles of a protein.
A number of structure-based function prediction methods have been developed to take advantage of the surge of new protein structures. Some of these rely on expert knowledge for defining the features useful for classifying a particular functional site, while others learn the important features through supervised machine learning approaches. An example of the former is Fuzzy Functional Forms [18], which are three-dimensional descriptions of functional sites based on conserved geometry, protein conformation, and residue identity. The descriptions are built by hand using information from solved crystal structures and published literature, and were able to help identify functional sites in structures whose sequence similarity to known proteins was low enough to render sequence-based tools ineffective [19].
Constructing models manually is time-consuming, however, and several more tractable methods have since been developed. ProKnow [20] uses features extracted from sequence or structure via established tools such PSI-BLAST [21], DALI [22], PROSITE, and the Database of Interacting Proteins (DIP) [23] to map proteins to functional terms in the Gene Ontology (GO) [24]. An alternative method by Polacco and Babbitt [25], called Genetic Algorithm Search for Patterns in Structures, or GASPS, constructs short three-dimensional motifs of functional sites consisting of conserved residues through an iterative mutation and selection process. Secondary Structure Matching (SSM) uses a graph-based representation of secondary structure to find similar structural matches to a query structure from the PDB [26]. Laskwoski et al. [27] presented the idea of 3D templates, which are spatial arrangements of three residues representative of functional sites or ligand-binding sites. These can be built from known examples and matched to the query, or the query structure itself can be broken into 'reverse' templates and matched against the PDB.
Perhaps the most ambitious solution to the problem of automated function prediction is ProFunc [28]. Combining about a dozen different sequence and structure-based methods, including database pattern searches, SSM, and 3D templates, ProFunc offers an impressively complete arsenal of methods for function prediction in one convenient, web-based tool. A recent study tested ProFunc's usefulness in predicting function for structural genomics targets and found that SSM and 3D templates were most effective [29]. To determine correctness of their predictions, they compared GO terms between the query and potential hit.
One difficulty with evaluating the performance of function prediction methods is the complex way in which protein function is defined. Function commonly describes specific enzymatic activities such as isomerization or phosphorylation, but it also encompasses binding to macromolecules or cofactors, modification sites for the attachment of lipids or other molecules, and general association in a biological pathway or complex. Although there are many classification schemes that cover one or more types of function, there is no functional classification describing all types of function that allows comparisons between different levels of the classification. The Enzyme Commission (EC) system [30] is widely accepted for enzyme classification, but it does not describe non-enzymatic functions or take into account sequence conservation or mechanism, which can indicate an evolutionary relationship [31]. The GO database is comprehensive and its terms are extremely popular for biological annotation, but it is difficult to compare terms when function predictions are made at different levels of the GO hierarchy.
The outputs of function prediction methods are also difficult to compare; for example, SSM returns an entire structure or portion of a structure that matched the query, while 3D templates returns either a precise prediction of function at a three-dimensional location in the query, or a protein containing similar residue geometry to the query, depending on the type of template chosen. The ambiguity behind the concept of 'function' and even its location in a structure with the concomitant diversity of frameworks and outputs make it very challenging to compare the performance of different methods. By restricting a comparison to a subset of functions that is relatively well defined, however, such as enzymatic functions, one can gain an impression of how each method performs.
Here, we present and apply SeqFEATURE, an automated method for protein function annotation from structure that is an extension of FEATURE, a more general framework published previously [32]. FEATURE models the local threedimensional microenvironment surrounding functional sites and is, therefore, mostly independent of sequence or structure homology. Although FEATURE performs well [33], the need for manually curated sets of positive training examples may limit its utility. SeqFEATURE addresses this limitation by automatically extracting training sets from the PDB using sequence motifs as seeds [34]. The FEATURE framework models three-dimensional motifs using physical and chemical properties, and thus attempts to generalize the onedimensional motif by recognizing similar three-dimensional environments that do not share significant one-dimensional similarity. We have used SeqFEATURE to build a library of functional site models from PROSITE motifs and evaluated its performance through cross-validation. Importantly, we also compared SeqFEATURE to PROSITE, Pfam, HMMPanther, Gene3D, SSM and 3D templates, and further examined each method's performance in low sequence identity and low structural similarity situations.
As a first step in aiding structural genomics and function prediction efforts, we have applied SeqFEATURE in a comprehensive scan of the entire PDB and focused our analysis on structures from the TargetDB repository of structural genomics targets. We report several interesting case studies from this analysis and compare SeqFEATURE's predictions to those of other methods. All data from the scan and all of the functional site models created to date are available for download [35]. Additionally, one can scan any structure in PDB format with the full library of SeqFEATURE models.

Results
We built a library of 3D functional site models using the FEA-TURE algorithm applied to training sets extracted automatically through sequence motifs found in the PDB. The library was evaluated using cross-validation and compared to existing sequence and structure-based function prediction methods. We also investigated potential functions for structures with unknown function.

SeqFEATURE model library
The SeqFEATURE library consists of 136 models derived from 53 PROSITE patterns (Table 1). Of these models, 105 (77%) have an AUC greater than 0.8, and 64 (47%) have an area under the curve (AUC) greater than 0.95 (Figure 1a). Sensitivity at the default 99% specificity cutoff is slightly more variable, but 82% of the models have sensitivity greater than 0.5 and 59% have sensitivity greater than 0.75 ( Figure  1b).
Receiver operating characteristic (ROC) curves from crossvalidation and Z-score distributions of the final models can be used together to evaluate the ability of the model to distinguish true sites from the background. We evaluate the separation between the positive and negative sites by plotting the distributions of Z-scores for the positive and negative training examples. Plots of positive predictive value (PPV) versus sensitivity give the proportion of total hits to the models that are true positives as a function of sensitivity. Representative examples of ROC curves, PPV versus sensitivity curves, and Z-score distributions for a range of model performances are shown in Figure 2. Table 2 lists the top 25 performing SeqFEATURE models ranked by AUC. The sensitivity of these models is, in general, very high, especially at the default 99% specificity Z-score cutoffs. Even at 100% specificity over half of the models have greater than 0.75 sensitivity. This list also contains a wide range of PROSITE patterns, indicating that the method performs very well for many different types of functional sites.

Methods comparison
Cross-validation is not necessarily representative of how a model will perform on independent test data. In order to get a more realistic estimate of the library's performance, we constructed a specialized test set from the PROSITE records for each pattern, which contain manually curated annotations of true positives, false positives, and false negatives. The test sets consisted, therefore, of structures that the associated PROSITE pattern is known to detect correctly, falsely predict, and altogether miss.
Importantly, we could directly compare if and where SeqFEA-TURE outperforms the originating PROSITE pattern. Figure  3a-c show the numbers of true positives, false negatives, and false positives predicted by SeqFEATURE at varying specificity-based score cutoffs compared to the corresponding Some PROSITE patterns have multiple functional residues, and so more than one model was built for these patterns (for example, multiple EF_HAND models are built). Many hits will score high on all models, but distantly related sites may hit only a subset. SeqFEATURE models are named by concatenating the PROSITE-PATTERN, POSITION, RESIDUE and ATOM for unambiguous identification. PROSITE pattern. Figure 4 shows overall numbers of predictions in each category. Since the test sets were derived from PROSITE, the PROSITE values represent the maximum that could possibly be obtained for each type of prediction. While SeqFEATURE does not predict all true positives correctly, it predicts 82% of true positives correctly at the default 99% specificity cutoff. At the same cutoff, SeqFEATURE also predicts about 78% fewer false negatives than PROSITE, and about 60% fewer false positives.
When we compared performance between SeqFEATURE, Pfam, HMMPanther, and Gene3D (restricting the comparison to the high confidence assigned patterns for each sequence-based method as described in Materials and methods), we found Gene3D to be the best performing method by far, with sensitivity just over 98%, specificity at 85.4%, and PPV at 99% ( On the randomized sample test set (see Materials and methods), we were able to compare SeqFEATURE to 3D templates and SSM (Table 4). Here, SeqFEATURE's best sensitivity increased to 93%, though its best specificity dropped to 93%. PPV decreased slightly to 94% at the most stringent cutoff. 3D templates performed most well out of the structure-based methods, with 90% sensitivity, 100% specificity, and a PPV of 100%. SSM performed similarly to SeqFEATURE.
Importantly, however, since the goal of many function prediction methods, including SeqFEATURE, is to aid in annotation of solved structural genomics targets, we also compared SeqFEATURE to the sequence-based methods using low sequence identity test sets to mimic the situation in which a newly solved structure has low sequence identity to proteins of known function. Table 3 shows the sensitivities of PROSITE, Gene3D, Pfam, HMMPanther and SeqFEATURE at 95% and 99% specificity cutoffs for test sets filtered at 25%, 30%, and 35% sequence identity to the training set. The sequence-based methods perform less well, particularly on sequences filtered at 30% and 25% identity. In contrast, SeqFEATURE achieves a sensitivity of 92.3% at the most lenient cutoff and 84.6% at the moderate cutoff for sequences with <25% sequence identity to the training sets. As shown in Figure 5, the sensitivity of sequence-based methods decreases directly in proportion to sequence identity, whereas the sensitivity of SeqFEATURE at all three cutoffs shows no downward trend. The observation that SeqFEATURE's performance remains robust reflects the fact that SeqFEA-TURE's true positive predictions are concentrated at lower Distribution of AUC and sensitivity for all SeqFEATURE models listed in Table 2  Figure 1 Distribution of AUC and sensitivity for all SeqFEATURE models listed in Table 2. To determine whether the degree of structural similarity affects how well different methods predict function, we also constructed a low structural similarity test set using Dali pairwise matching between members of the low sequence identity test set and the corresponding positive training sets.
Although relatively small (15 examples), the low structural similarity test set allows us to approximate the situation of function prediction on novel folds. As illustrated in Table 4, SeqFEATURE performs better at the 95% and 99% specificity cutoffs than the other structure-based methods; its low structural similarity (LS)-sensitivity is 53% and 47%, respectively, while the LS-sensitivity values for SSM and 3D templates are both less than 30%.
Example ROC curves, precision-recall curves, and Z-score distributions for SeqFEATURE models Figure 2 Example ROC curves, precision-recall curves, and Z-score distributions for SeqFEATURE models. A sample of performance plots for ADH_ZINC.2.HIS.ND1 (top), ASP_PROTEASE.4.ASP.OD1 (middle), and ZINC_FINGER_C2H2_1.9.HIS.ND1 (bottom) are shown, representing a model with excellent performance, good performance, and somewhat satisfactory performance, respectively. The leftmost plot in each row gives the ROC curve in red and random performance in blue, the middle plot shows the precision versus recall (sensitivity) curve, and the rightmost plot shows the distribution of scores for positive sites (red) and negative sites (blue) from training. Because there are many more negative sites than positive sites, the score distributions on the right are normalized to Z-scores.

Predictions of function for structural genomics targets
As of November 2007, TargetDB contained about 5,250 targets with structures released in the PDB; of these, about 1,500 were labeled only with 'structural genomics', 'unknown function', or 'hypothetical protein' in the PDB file header. Using the criteria of model AUC > 0.8, maximum score of that model's negative training set, and minimum score of that model's positive training set, we found 35 potential functional sites. We added one more predicted functional site that did not quite satisfy the criteria but had several such hits for multiple models for the same function, resulting in a total of 36 high-confidence predictions. We compare our predictions to those of PROSITE, Pfam, Gene3D, HMMPanther, SSM and 3D templates for the same structures.
In examining these structures, we found that some of them, though labeled as 'unknown function', actually had some functional annotation and, thus, we could determine the plausibility of our prediction. For example, PDB structure 1XRI is described as a putative phosphatase, and had a high scoring hit for the TYR_PHOSPHATASE_1.3.CYS.SG model. All of the other methods also detected phosphatase activity. Another example is 2E72, described as a zinc-finger containing protein, which hit our ZINC_FINGER_C2H2_1.1.CYS.SG model and for which Pfam, Gene3D, HMMPanther, SSM, and 3D templates all predicted zinc finger motifs.
More interesting, however, are predictions for structures that fail to garner any predictions from PROSITE, Pfam, Gene3D, or HMMPanther. Table 5 presents three of our most intriguing cases. In all of these cases, only SeqFEATURE gives a high-confidence prediction, though 3D templates and SSM sometimes offer matches to putative functions or have lowconfidence predictions. In contrast, the SeqFEATURE predictions have relatively high Z-scores compared to the training set distributions.

Protein Data Bank scan
We additionally scanned every structure in the PDB -about 100 million potential sites -with every SeqFEATURE model. When we consider only those scores that came from models with an AUC of at least 0.95, and were greater than the 99% specificity cutoff defined for that model, 440,460 scores fit these criteria, or about 0.5% of the total number of scores generated. Filtering out redundant scores from proteins with multiple chains results in 298,870 predictions in 29,668 structures. The raw data from the scan are available for download [36]; further analysis of these predictions is beyond the scope of this paper. To access the full library scan for one structure, the user may query by PDB ID; alternatively, one can access all results by querying for a specific SeqFEATURE model.

Discussion
SeqFEATURE extends earlier work on characterizing functional sites in protein structures by automating training set selection. We have used it to build a library of three-dimensional functional site models, 77% of which have an AUC greater than 0.8. When tested on untrained but known true positives, false positives, and false negatives from their corre-  Table 3 Comparison of SeqFEATURE (at three specificity-based score cutoffs) to Gene3D, Pfam, and HMMPanther (Panther)

Gene3D
Pfam We calculated the sensitivity (on PROSITE true positives (TP), PROSITE false negatives (FN), and overall), specificity, PPV, and sensitivity at varying levels of sequence identity for each method using the portion of the test set corresponding to the coherent patterns for that method. Bold values indicate the top three performing methods for that row. The sequence-based methods are expected to do very well since they have the advantage of abundant sequence data for modeling functional families. SeqFEATURE performs relatively less well in sensitivity and specificity, but is still very competitive, especially with regards to positive predictive value and false negative sensitivity. At lower sequence identities, Gene3D outperforms all other methods at sequence identities greater than 30%, but SeqFEATURE remains robust throughout and, in particular, shows better performance than Gene3D and Pfam when sequence identity is less than 30%. SeqFEATURE is the best performing method by far when the proteins have less than 25% sequence identity to the training set.
sponding PROSITE patterns, many models correctly classified all of the true positives and some of the false negatives, and had fewer false positive predictions than the pattern. Even when a model failed to recapitulate every PROSITE true positive, it often correctly predicted proteins that the PROSITE pattern missed.
Furthermore, we show that although SeqFEATURE demonstrates slightly lesser performance than the sequence-based methods overall, it exhibits useful performance trends as sequence identity to proteins of known function decreases. SeqFEATURE, and perhaps structure-based methods in general, should be most valuable in these scenarios, since they sense three-dimensional atomic environments rather than the sequences that fold to create those environments. We observe that this advantage is strongest when the sequence identity is less than 30%, which is well-documented as the 'twilight zone' of sequence analysis [37].
When we further investigate this region of low identity, we see that SSM and 3D templates do not perform as well as SeqFEATURE on the low structural similarity test set. SSM is essentially a fold-matching algorithm, and at low structural similarities the folds of the test structures likely differed significantly from those folds most representative of proteins with the function in question. Theoretically, the 3D template method is more similar to SeqFEATURE, but in reality it performed similarly to SSM. It is possible that the residue triads that 3D templates detect were dependent on exact conservation of sequence features. In contrast, SeqFEATURE was less affected by the reduction in structural similarity because it depends less on specific sequences or arrangements of residues, and instead incorporates abstract physical and chemical properties in a locally defined region.
Determining how different methods compare in predicting function is a challenging task, and so neither our procedure for comparing methods nor the interpretation of the comparison's results is straightforward. Function itself is broadly defined and does not lend itself easily to straightforward or computable classification schemes. Many classifications are applicable only to specific types of functions and can differ in the scope of their descriptions, ranging from whole domain labels on sequence (for example, Pfam) to exact locations in structures (for example, SeqFEATURE or 3D templates).
Responding to this diversity in description and classification, we made several choices in our comparison of sequence and structure-based methods, each of which carries a certain amount of bias.
In comparing Pfam, HMMPanther, and Gene3D to other methods, for example, we restricted the evaluation to those functions (PROSITE patterns, specifically) whose SeqFEA-TURE positive training sets mapped unambiguously to the corresponding database assignment. This may have artificially boosted performance of the sequence-based methods, We calculated the sensitivity, specificity, and PPV for SeqFEATURE, 3D templates, and SSM using a random subset of 29 positive and 15 negative structures. Additionally, we calculated sensitivity for each method on a low structural similarity subset ('LSS-sensitivity') of the positive test set. Bold values indicate the top two performing methods for that row. SeqFEATURE performs relatively less well overall, but when structural similarity is reduced, SeqFEATURE again is the best performing method. The different types of input used to train each method also have some implications, an important one being that sequence-based methods currently have much more data available to them than structure-based ones. Although this means that the best sequence-based methods currently outperform structure-based methods on our unfiltered PROSITE-based test sets, it does not diminish the need for or value of structure-based methods. Such methods are useful precisely when sequence identity to known proteins is low, as shown in our results on low sequence identity test sets and our analyses on interesting TargetDB predictions.
The two structure-based methods compared here contain an analogous advantage, however, in that they match the query against the entire repository of known protein structures. Thus, if the query has very similar structures (for example, the same protein from different species) in the PDB, SSM and 3D template searches will very likely result in a high confidence hit to these structures. In cases where the query structure is completely novel, however, SSM and 3D templates are expected to do less well, as suggested by their performance on the low structural similarity test set. SeqFEATURE, on the other hand, because it does not rely on exactly conserved geometries or structural motifs, continues to show robust performance even when the structure does not share significant similarity to known proteins.
Another potential bias may come from limiting the structurebased comparisons to those patterns associated with EC numbers. In order to determine the correctness of predictions from SSM and 3D templates, we required a precise functional classification system. SCOP is a potential alternative evaluation method, but SCOP is a structural classification that does not always map directly to function, so we chose to use EC numbers. This, of course, means that the results of the comparisons may not be representative of how each method performs on non-enzymatic functions. The use of EC numbers is also affected by how accurately and completely the PDB is annotated and by the granularity of function assigned. Several of the test structures on which 3D templates and SSM performed poorly had matches to proteins annotated with only slightly different EC numbers. Thus, 3D templates and SSM should still be considered valuable tools for gaining insight into the potential function of an uncharacterized protein.
Although the set of patterns and the resulting test sets used here are by no means fully representative or without bias, they enabled us to map our SeqFEATURE models directly to test sets, a non-trivial endeavor given the inconsistency and variety of existing function classifications. It also allowed us to look specifically at where SeqFEATURE improves on or fares worse than the sequence patterns that generated the models. We often chose test sets with biases against our method in order to assess its operating characteristics accurately; for example, our use of one-dimensional sequence Sensitivity trends of SeqFEATURE, Gene3D, Pfam, and HMMPanther at low sequence identities Figure 5 Sensitivity trends of SeqFEATURE, Gene3D, Pfam, and HMMPanther at low sequence identities. We compared the sensitivity of SeqFEATURE at three specificity cutoffs against the sensitivity of Gene3D, Pfam, and HMMPanther on test sets filtered for low sequence identity. We evaluated each method on the subset of the original test set that had less than the specified sequence identity to the training sets. As sequence identity decreases, the sequence-based methods show a clear trend towards lower sensitivity. In contrast, SeqFEATURE at all three cutoffs shows no such downward trend, indicating robust detection of function even when sequence identity is very low. predicted by SeqFEATURE. Note the presence of two histidine residues (one can be seen clearly above each site) and a number of negative charges distributed throughout both local environments. Note also the similarity in secondary structure. (b) The local structure of 1FI6 (left), which contains a known EF hand calcium-binding motif, is compared to SeqFEATURE's predicted calcium-binding site in 2OGF. Note the similar distribution of negative charges and closely matching loop structures. The calcium is visible as a brown sphere in 1FI6, surrounded by oxygen atoms. (c) 1K8U is another known EF hand containing protein, shown to the left of the uncharacterized protein structure 2OX6, for which SeqFEATURE predicts calcium-binding. These figures were created using VMD [43]. patterns as the gold standard provides a strong advantage to sequence-based methods. Restricting the comparison to patterns that mapped coherently to Pfam, Gene3D, and HMMPanther families may also predispose those methods to good performance. SeqFEATURE exhibited good performance despite these biases.
Because SeqFEATURE also focuses on the local microenvironment around functional sites, it can detect function at finer detail than fold-matching algorithms such as SSM.
Because it considers both atom-based and physichochemical properties in addition to residue-based ones, it is also capable of generalizing function away from sequence and may be able to detect functional similarities that have converged from different ancestors or that use slightly different residues and a different overall fold to accomplish similar activities. This capability is demonstrated by the fact that SeqFEATURE detects many of the positive examples that the PROSITE pat- tern misses. The ability to abstract the properties relevant to function independent of sequence or structural homology is one of SeqFEATURE's biggest strengths.

Overview of the SeqFEATURE pipeline
Another one of SeqFEATURE's advantages is that score cutoffs can be adjusted to reflect the user's desired performance criteria, for example, estimated specificity, sensitivity, or positive predictive value. The ratio of true positives to false positives and false negatives is traded off depending on where the score cutoff is set. There are several additional filters one can use to boost the confidence of positive predictions. True hits often manifest themselves as a cluster of high-scoring positive predictions for the same or related functional site models.
Single, isolated hits in a protein, although potentially interesting, may not have the exact function represented by the model.
The functional 'fingerprint' of each model (as shown in Figure  7d) also allows detailed understanding of the physicochemical environment representative of that type of functional site, and detailed inspection of potential positives may boost confidence of positive predictions or help explain the existence of any false positives. Even if the SeqFEATURE prediction is not entirely accurate, the fact that it is based on a representation of the local physical and chemical environment means that we can still make interesting observations about what properties Table 6 Physicochemical properties used by the FEATURE algorithm

Atom-based
Molecule These properties are expressed at the atomic, molecular, residue and secondary structural levels of abstraction. The properties at the atomic, molecule and secondary structural level are designed to make the FEATURE models relatively less dependent on primary amino acid sequence, in an attempt to improve performance on highly divergent (or convergent) sites. Most of these properties are simply counts of the property, and a few are continuous valued, as discussed in [32].
helped the site score highly, and which additional properties may be necessary for the site truly to contain the predicted function.
Most importantly, since SeqFEATURE is not dependent on sequence or overall structural fold, it can be used when either the sequence or the structure is novel. This became evident when we compared the performance of the different methods at low sequence identities and low structural similarities, and found that SeqFEATURE shows a trend to being more sensitive than sequence-based methods at low sequence identities and more sensitive than other structure-based ones at low structural similarities. As shown with the three TargetDB examples, SeqFEATURE is able to predict function where other methods are not. Further inspection of the putative sites reveals compelling evidence for SeqFEATURE's predictions. The ability to provide useful predictions on novel structures will become more and more important as structural genomics matures, and SeqFEATURE demonstrates robust performance in this area.

Conclusion
Advances in protein structure determination have led to an increase in unannotated structures, many with low sequence identity to known proteins. Our method, SeqFEATURE, uses functional sequence motifs to seed training sets from which three-dimensional models of the function are built. We used SeqFEATURE to construct a large library of three-dimensional functional site models from PROSITE motifs and scanned uncharacterized structural genomics targets from TargetDB for function. SeqFEATURE's descriptive and intuitive models show comparable performance to existing sequence-and structure-based methods. Importantly, SeqFEATURE models retain robust performance when sequence identity and structural similarity are reduced. Methods such as SeqFEATURE that do not rely on strict sequence or structure conservation will be valuable tools for annotating novel protein structures.

Materials and methods
SeqFEATURE ( Figure 7) is a method for automatically selecting training sets and building structural models within the FEATURE framework, a system for modeling functional sites in protein structures that has been published previously [32].
Here, we summarize the FEATURE algorithm and present SeqFEATURE in more detail.

The FEATURE algorithm
FEATURE builds statistical three-dimensional models of the local environment around a functional site given training sets of positive and negative examples. These models can then be used to evaluate test sites and predict whether they have particular functions. FEATURE calculates a number of physicochemical properties (Table 6) at varying radial distances from the site center and creates a feature vector containing the values of each property in each radial volume (Figure 7c). Both atomic and residue level properties are examined, allowing the functional site to be described at multiple levels. The structural model is constructed by comparing the statistical distribution of properties between positive and negative sites.
In a model for a particular functional site, properties are described as either significantly more present, more absent, or having no significant difference in the positive sites compared to negative sites using the Wilcoxan rank sum test [32]. The significant properties can be visually displayed as a colorcoded matrix unique to that model, which we call its fingerprint (Figure 7d).
Using a naïve Bayes scoring function, FEATURE can then evaluate the likelihood that a query site contains the function described by a particular model. A feature vector is created for the query site in the same way as for the training sites, and a likelihood score is calculated assuming independence of each individual feature υ i : A score cutoff for classifying a query site can be chosen for each model according to the user's desired performance criteria ( Figure 7e).

Training set selection
SeqFEATURE adds to the FEATURE framework by using one-dimensional sequence motifs as seeds for generating training sets of structural examples (Figure 7a). This method was first introduced by Liang et al. [34] in a single application to calcium binding by EF-hand motifs, and is extended and applied here into a full library of functional site models. To build the library of models, we extracted structural examples of PROSITE functional site patterns from the ASTRAL40 compendium [38], which is a nonredundant subset of protein domains in the PDB. We required training sets to have a minimum of five structural examples.
PROSITE patterns are regular expressions that specify the amino acids permitted at each position of the motif. We defined functional site centers to be the functional atom(s) of annotated functional residues in each pattern, for example, the gamma oxygen of serine, or SER.OG. For patterns with multiple functional residues or multiple functional atoms, we built multiple models for the same PROSITE pattern. For example, the PROSITE pattern EGF_1 has functional cysteine residues at positions 1, 3, and 7, so there are three models centered at three atoms in this pattern -EGF_1. 1 the motif, the residue at that position, and the atom within that residue upon which the model is centered. See Table 1 for a complete list of SeqFEATURE models.
Positive training sets consist of PDB coordinates of functional atoms as described above, extracted from structures containing that particular pattern. We selected negative training sets randomly from identical residues in the rest of the PDB whose atom compositions and densities are similar to the positive sites. In order to define the background distribution of the functional site environments, we used a thousand times as many negative sites as positive sites for each model, when possible, but never less than 4,000.

Model cross-validation and evaluation
We internally evaluated each model using five-fold cross validation by partitioning the positive and negative training sets randomly into five blocks. For each run, we used four blocks to build the model and tested performance on the remaining block. To compare results across runs, we transformed the scores into Z-scores by standardizing to the mean and standard deviation of the negative score distribution.
To measure performance, we employ ROC curves, which plot the true positive rate (sensitivity, or the ratio of true positive predictions to all true positives) against the false positive rate (1-specificity, or the ratio of false positive predictions to all true positives) at varying Z-score cutoffs. We also use PPV versus sensitivity to gauge the performance of a model. Sensitivity, specificity, and PPV are calculated as follows: The AUC estimates the probability that a random positive site will be scored higher than a random negative site, and provides a summary measure of the performance of the model. The final models used all of the training examples, and include score cutoffs calculated for 95%, 99%, and 100% specificity based on cross validation data.

Comparison to other function prediction methods
The manually curated PROSITE record for each pattern contains known true positives, false positives, and false negatives predicted by that pattern, listed using Swiss-Prot identifiers. We treated each Swiss-Prot ID as a unique protein. Taking existing mappings between Swiss-Prot and the PDB, we also converted each list into a list of corresponding PDB structures to use as input to SeqFEATURE and other structure-based prediction methods. Thus, our positive test set consisted of Swiss-Prot IDs and PDB structures for proteins annotated as true positives and false negatives in PROSITE, and our negative test set consisted of Swiss-Prot IDs and PDB structures for proteins annotated as false positives. We removed all positive training set structures from the test sets and filtered the test structures to ensure that they contained the functional regions described by the appropriate PROSITE pattern.
Using these test sets, we compared performance among PROSITE, Pfam, Gene3D, HMMPanther, SSM, 3D templates (reverse template type), and SeqFEATURE. In order to ensure consistency across the comparisons, we restricted the analysis to patterns that had at least one model with an AUC >0.75 and that also mapped unambiguously to the pattern database or tool being compared. For example, to determine the Pfam assignment for a particular pattern, we looked up the set of Pfam assignments for each structure in the training set using Pfam's publicly available mappings. Unambiguous assignments were those for which either 100% of the training set mapped to the same Pfam family, or for which the Pfam family clearly matched the PROSITE pattern (for example, PROSITE pattern GLYCOSYL_HYDROL_F10 and Pfam family 'Glyco_hydro_10'). Forty-two PROSITE motifs had both an AUC >0.75 and a positive test set independent of the training set (TRYPSIN_HIS was excluded due to it being nearly identical to TRYPSIN_SER), and, of these, 31 mapped unambiguously to Pfam, 12 to Panther, and 29 to Gene3D.
Because structure-based methods such as 3D templates and SSM are more computationally expensive to run than SeqFEATURE and the sequence-based methods, we split the comparison into two parts. The first part compared PROSITE, Pfam, HMMPanther, Gene3D, and SeqFEATURE, and covered the unambiguous portions of the test sets in their entirety. PROSITE's predictions came directly from its annotations. For the other sequence-based methods, we analyzed the test set proteins using each tool and marked a protein as a positive prediction if at least one of its mapped predictions matched the unambiguous assignment for the pattern being tested. HMMPanther and Gene3D were run from the Inter-Pro servers using the stand-alone downloadable Perl client [39]. Pfam's predictions were taken directly from their publicly available mapping file. For SeqFEATURE, we classified a protein as positive if at least one of its mapped PDB structures scored above the specified cutoff for at least one model derived from that pattern. Since SeqFEATURE cutoffs are variable, we tested performance at 95%, 99%, and 100% specificity cutoffs.
To compare SSM, 3D templates, and SeqFEATURE, we limited our test sites to those derived from PROSITE patterns that mapped to EC numbers. Since 3D templates (reverse template type) and SSM both return protein structures rather than a named function as output, we used EC numbers to evaluate predictions made by SSM and 3D templates. We determined the set of EC numbers corresponding to each pat-Sensitivity # of true positive predictions total # of true = p positives Specificity # of true negative predictions total # of true = n negatives Positive predictive value # of true positive predictions to = t tal # of positive predictions