PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites
© Gnad et al.; licensee BioMed Central Ltd. 2007
Received: 29 June 2007
Accepted: 26 November 2007
Published: 26 November 2007
PHOSIDA http://www.phosida.com, a phosphorylation site database, integrates thousands of high-confidence in vivo phosphosites identified by mass spectrometry-based proteomics in various species. For each phosphosite, PHOSIDA lists matching kinase motifs, predicted secondary structures, conservation patterns, and its dynamic regulation upon stimulus. Using support vector machines, PHOSIDA also predicts phosphosites.
Protein phosphorylation is a ubiquitous and important post-translational modification, responsible for modulating protein function, localization, interaction and stability [1–4]. High-throughput experimental studies such as our recent large scale analysis of the human phosphoproteome by quantitative mass spectrometry, in which we measured the time courses of more than 6,600 phosphorylation sites in response to growth factor stimulation , enable us to study biological systems from a global perspective. Those sites were identified by high resolution mass spectrometry with an estimated false positive rate of less than one percent and constitute an unbiased, in-depth sampling of the in vivo phosphoproteome. In addition, PHOSIDA includes large-scale phosphoproteomes from various eukaryotic and prokaryotic organisms, such as Bacillus subtilis  and Escherichia coli, providing information about the evolution of phosphorylation events in the cell.
We developed PHOSIDA to retrieve and analyze phosphosites from large-scale and high-confidence quantitative phosphoproteomics experiments, usually studying the response of biological systems to various stimuli by the integration of time course data. Thus, it is the first phosphosite database to explicitly store quantitative data on the relative level of phosphorylation. PHOSIDA also matches kinase motifs to phosphosites. A challenge in mass spectrometry-based phosphosite mapping is the fact that phosphopeptides are measured, which then need to be mapped to one or more corresponding protein sequences. This problem is addressed in PHOSIDA by a many-to-many mapping between phosphopeptide sequences and protein entries in the sequence database. One of the fundamental strengths of PHOSIDA lies in the high quality of the in vivo data contained in the database and in the very large size of its in vivo data sets.
In this paper we describe the features and capabilities of PHOSIDA. We also use the analysis tools in PHOSIDA to investigate the structure and evolution of the phosphoproteome from a global point of view. Recent studies have found support for the hypothesis that protein phosphorylation occurs predominantly within regions without regular structure [7, 8]. This was also the conclusion of a recent paper describing MitoCheck (mtcPTM) , a recently established database containing phosphorylation sites of human and mouse. These authors used known structures and homology modeling to determine the structural constraints of phosphorylation sites. Here we investigate and quantify this observation on a very large in vivo dataset. The resulting secondary structure and accessibility information for each phosphosite is available in PHOSIDA.
Although conservation of specific sites is often taken to imply biological importance, relatively little is known about the evolutionary constraints on the phosphoproteome. We investigated these constraints on three levels: conservation of phosphoproteins, regions surrounding the site and the phosphosite itself. Consequently, PHOSIDA provides the evolutionary conservation of each phosphosite at these three levels.
In addition, we took advantage of the large number of in vivo phosphosites to create a phosphosite predictor in PHOSIDA. There have been various machine learning approaches to predict phosphorylation sites. For example, the prediction system Netphos  is based on neural networks, whereas Scansite uses a profile method to predict phosphorylation events . We use our large-scale studies to construct a phosphorylation site predictor on the basis of a support vector machine (see  for an introduction). Support vector machines (SVMs) have been applied to a large variety of fields ranging from internet fraud to topics in molecular biology, such as classification of gene expression profiles, and there has already been one study that applied SVM techniques to predict phosphorylation sites . However, that approach was exclusively based on the primary sequences of around 1,000 phosphorylation sites. Here we construct a predictor based on more than 5,000 high confidence phosphosites. We also show that information about the structure and conservation of phosphorylation sites slightly increases the performance of the predictor.
Furthermore, PHOSIDA can search for motifs of interest in any input sequence. These motifs can be user generated or drawn from already annotated kinase motifs.
Database management of phosphorylation sites
As mentioned above, PHOSIDA was first developed to facilitate retrieval and analysis of high-confidence phospho-datasets generated in our group. For example, PHOSIDA contains a large number of phosphorylation sites from human cell lines exposed to growth factor stimulation. Protein assignments are based on the IPI database , which is cross-referenced with the Swissprot database by PHOSIDA. Entries of both databases that correspond to the same proteins were aligned to derive the exact positions of protein features such as domains, active sites, motifs, and binding sites. Already annotated phosphosites derived from Swissprot are transferred to the IPI sequences in the same way. The aligned regions can be visualized via 'check alignment' buttons. Phosphoproteome data generated by the community will be regularly imported into PHOSIDA in this way rather than by individual import of specific projects. PHOSIDA will be updated with sites identified according to Swissprot every 6 months at the least or as soon as substantial new large-scale studies on phosphorylation are included in Swissprot. In the case of prokaryotic phosphorylation sites, the protein assignment was exclusively based on the TIGR database.
Structural investigation of the phosphoproteome
We next wanted to confirm the generality of this observation for phosphoproteins with a solved structure and determined proteins from our human phosphoset that had a structure in the Protein Data Bank  and mapped our in vivo phosphorylation sites to their three-dimensional coordinates. Secondary structures were assigned by DSSP . DSSP is a program that assigns secondary structures to given three-dimensional coordinates of atoms of proteins. In total, we assigned 26 phosphogroups to 16 structures of different proteins (Additional data file 2). As is apparent from the structures, the phosphogroups are always located in highly accessible parts of the proteins. Furthermore, in all but one case the phosphogroups are found in flexible parts of the structure (hinges or loops). In 12 cases the structure around the phosphosite was so flexible that it had not been determined at all (Additional data file 3).
Evolutionary conservation of the phosphoproteome
We next wished to integrate another dimension of biological information of the phosphoproteome into PHOSIDA, namely its evolutionary conservation. We determined homologous proteins to all phosphoproteins across 70 species from E. coli to mouse via BLASTP . The homology search was performed against protein databases of 53 bacteria, nine archaea, and eight eukaryotes. These databases were retrieved from Swissprot  in the case of Archaea and Bacteria. The yeast proteome was downloaded from SGD , Drosophila melanogaster from FlyBase  and the other eukaryotic sequences from IPI. We defined proteins to be homologous when the resulting E-values were lower than 10-5. For homologous proteins, we used a bidirectional BLASTP approach to distinguish between paralogs and orthologs .
PHOSIDA displays the results of the homology searches using an approximate phylogeny of all investigated species. Taxonomic divisions are displayed on-screen when the cursor is pointed at the phylogenetic tree. If the selected phosphoprotein is not homologous to any protein of a certain organism, that organism is highlighted in red. If the similarity between the sequence of the phosphoprotein and its homologous protein was the significantly best one in both directions, the given organism is highlighted in green. A higher similarity between the sequence of the homologous protein and another protein of the organism of the selected phosphoprotein suggests paralogy, which is indicated in blue.
On the basis of these global alignments for orthologous phosphoproteins, we found that regions containing phosphorylation sites showed lower conservation than the average conservation of the entire protein. As seen in Additional data file 4, the average identity in the 40 amino acid window surrounding the aligned phosphorylation sites is lower for each eukaryotic species compared to the entire protein identity. This effect is most pronounced for serine and threonine due to their almost exclusive location in fast evolving loop and hinge regions.
These data suggest that the surrounding sequence regions may diverge to such an extent that the structural effect (fast sequence evolution) could compete with the constraining pressure of function (slow sequence evolution). In order to correctly assess the degree of conservation of phosphosites, it is therefore important to take the structural effect - fast evolution of loop regions - into account. We did this by choosing only sites located in loop regions for the comparison set, which should isolate the functional, evolutionary constraints on the phosphosite itself.
Prediction of phosphorylation sites using support vector machines
We investigated several common kernel functions and found that the radial basis function (RBF) turned out to be the most powerful compared to linear, polynomial and sigmoid Kernel functions. We optimized parameters C and σ, the width of the Gaussians used as the RBFs, and trained the optimal model for each set of each phosphor amino acid separately (Additional data file 5).
We found that the accuracy of the prediction based on the primary sequence was already very high: in the case of phosphoserines, 89.85% were predicted correctly in the test set as were 74.24% of the phosphothreonines (Additional data file 6). The accuracy of the prediction increased to 90.17% for pS and 77.27% for pT by adding structural information (sets b and c). For serines, the accessibility was slightly more important than the secondary structure information, whereas for threonines, the opposite was the case. The additional dimensions reflecting the conservation of the site and of the entire protein (set d) increased the accuracy to 90.70% (pS) and 81.06% (pT). By combining structural and evolutionary information (set e), we found that 91.75% in the serine set and 81.06% in the threonine set were predicted correctly. The accuracy of the prediction of phosphotyrosines increased from 66.67% to 76.19% when including the structural and conservational information. However, that increase is not significant due to the fact that there were only around 100 phosphotyrosines sites.
PHOSIDA includes the prediction of phosphorylated serines and threonines on any input sequence on the basis of the SVM, which was trained on the basis of raw sequences. Users can set a certain cutoff directly on the precision-recall-curve for the prediction. Sites that are predicted to be phosphorylated are automatically matched to annotated kinase motifs.
In addition to the prediction, we also integrated a simple tool that searches for matching kinase motifs on any sequence of interest. Alternatively, users can define their own motif and derive matching sites of the given sequence.
Outlook and conclusion
The concept of a phosphorylation site database is, of course, not a novel one. PhosphoSite  and Phospho.ELM  are already comprehensive databases that contain phosphorylation sites from different projects. The aim of PHOSIDA is to include very high quality input data as well as quantitative information such as regulation after stimuli. Additionally, we take into account structures and evolutionary data across a variety of species, in order to integrate biological context into the database and to quantify constraints of phosphorylation on a proteome-wide scale. Thus, PHOSIDA provides a rich environment to the biologist wishing to analyze phosphorylation events of proteins of interest.
Our analysis of a large and unbiased set of in vivo phosphorylation sites in human cells shows that phosphorylation events are not distributed along the whole protein structure but are instead constrained to sites of high accessibility and structural flexibility. Particularly in the case of serine and threonine, phosphorylation is almost completely restricted to loops and hinges. Tyrosine is found to some degree in regular secondary structure elements but phosphotyrosines are very likely to be in flexible regions as well. Mechanistically, localization of phosphorylation in flexible regions of the protein is advantageous as it provides access for the kinase to substrate, which needs to be positioned into the active site. Furthermore, functional consequences of the phosphorylation in many cases also depend on the flexibility of the phosphorylated sequence, such as when loops are repositioned after phosphorylation or when the phosphorylated loop participates in a protein-protein interaction. However, it is important to emphasize that the structural analysis was based on predictive methods rather than experimental data. Nevertheless, it stands to reason that the large size of the dataset should compensate for statistical errors caused by the prediction algorithm. Furthermore, as mentioned above, the Mitocheck database (mtcPTM)  also came to similar conclusions relating to structural constraints of phosphosites. This is gratifying because those authors used a different set of phosphorylation sites (gathered in the European Union consortium MitoCheck) and different methods to determine preferential phosphorylation on different secondary structure elements (homology modeling). The authors also noted that phosphorylation sites can accumulate at the flanks of structured domains and, in some cases, on buried residues. Interestingly, phosphorylation of these sites could destabilize part of the protein structure and, for example, allow or disallow protein-protein interactions .
The concordant results on structural constraints of phosphorylation sites between the MitoCheck study and this study also implicitly validate our use of the SABLE prediction tool for secondary structure and solvent accessibility prediction in this context. Here we have used these predictions to extend the feature space used in phosphorylation site prediction.
The analysis of the evolutionary sections of PHOSIDA shows that the number of orthologs of the human phosphoproteome is much higher than that of the entire human proteome, at least when analyzing the phosphoproteins identified by Olsen et al. . This probably reflects important and conserved functional roles of proteins with this post-translational modification. As a consequence of the location of phosphorylation sites in loops and hinges, the sequence regions around phosphorylation sites evolve faster than the rest of the protein. Practically, this leads to difficulties in correctly aligning phosphosites in orthologous proteins, which can be overcome by using a combination of fast, word-based algorithms (BLASTP) to find candidates and exhaustive algorithms to properly align phosphorylation sites (Needle).
Our analysis on the global alignments of orthologs in eukaryotes shows that phosphorylation sites are more conserved than non-phosphosites of the same proteins. However, for any given site the sequence identity is already very high, for example, more than 70% for serine and threonine in mammals. For tyrosine, conservation is even higher. Therefore, the mere conservation of a phosphorylation site in mammals or in vertebrates does not necessarily indicate high selection pressure. We found that a region of about five amino acids around the phosphorylation site is more conserved than the surrounding sequence context.
Furthermore, we integrated a tool that matches input sequences with annotated kinase motifs or motifs that are defined by users. In addition, we constructed a SVM-based prediction algorithm for phosphorylation. Training of the SVM on our large-scale dataset led to excellent prediction accuracy. We also showed that the inclusion of structural and evolutionary constraints on the phosphoproteome could slightly increase the performance of the predictor.
The PHOSIDA phosphorylation site predictor makes it possible to find putative novel phosphorylation sites that have not (yet) been experimentally identified. While experimental data, especially quantitative data, are the 'gold standard', predicting novel phosphosites and matching kinase motifs on proteins of interest should be valuable for the design of biological experiments or for predicting a protein's role in a pathway . Furthermore, once predictors are trained, these prediction methods are basically 'free'. We provide an interactive method for setting a desired level of precision and recall. For example, for mutagenesis experiments one may want to set the precision very high, and for rationalizing the function of a protein in a pathway one may want to set it relatively low. Thus, in the absence of experimental data, the prediction of novel phosphosites can be taken as the first method of an experimental design uncovering functionality of any protein of interest and elucidating its involvement in certain signaling cascades.
The integration process and analysis pipeline have been automated, so that structural and conservation data for phosphorylation sites from prospective studies can readily be incorporated into PHOSIDA. As new phosphorylation data are integrated to PHOSIDA our SVM will also be updated, leading to increasingly accurate predictions.
Upcoming projects will investigate the phosphoproteomes of prokaryotes, such as E. coli and Lactococcus lactis, and the dynamics of phosphorylation after various stimuli in B. subtilis and in eukaryotes such as D. melanogaster, mouse, and human.
Additional data files
The following additional data are available with the online version of the paper. Additional data file 1 is a figure showing the accessibilities of phosphorylation sites as calculated by SABLE. Additional data file 2 is a figure showing Protein Data Bank structures of phosphoproteins. Additional data file 3 is a table listing phosphorylation sites located in parts of phosphoproteins that are too flexible for structure determination. Additional data file 4 is a figure that illustrates the conservation of the region surrounding the phosphosite (-20 to +20 amino acids). Additional data file 5 is a table listing the optimal parameters for the SVM prediction. Additional data file 6 is a table listing the prediction accuracies of the SVM approach.
radial basis function
support vector machine.
We thank other members of the Department for Proteomics and Signal transduction for sharing insights, especially Marcus Krueger, Sidney Cambridge and Matthias Selbach. We also thank Prof. John Parsch for helpful discussions.
- Hunter T: Signaling - 2000 and beyond. Cell. 2000, 100: 113-127. 10.1016/S0092-8674(00)81688-8.PubMedView ArticleGoogle Scholar
- Cohen P: The regulation of protein function by multisite phosphorylation - a 25 year update. Trends Biochem Sci. 2000, 25: 596-601. 10.1016/S0968-0004(00)01712-6.PubMedView ArticleGoogle Scholar
- Pawson T, Nash P: Protein-protein interactions define specificity in signal transduction. Genes Dev. 2000, 14: 1027-1047.PubMedGoogle Scholar
- Schlessinger J: Cell signaling by receptor tyrosine kinases. Cell. 2000, 103: 211-225. 10.1016/S0092-8674(00)00114-8.PubMedView ArticleGoogle Scholar
- Olsen JV, Blagoev B, Gnad F, Macek B, Kumar C, Mortensen P, Mann M: Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell. 2006, 127: 635-648. 10.1016/j.cell.2006.09.026.PubMedView ArticleGoogle Scholar
- Macek B, Mijakovic I, Olsen JV, Gnad F, Kumar C, Jensen PR, Mann M: The serine/threonine/tyrosine phosphoproteome of the model bacterium Bacillus subtilis. Mol Cell Proteomics. 2007, 6: 697-707. 10.1074/mcp.M600464-MCP200.PubMedView ArticleGoogle Scholar
- Iakoucheva LM, Radivojac P, Brown CJ, O'Connor TR, Sikes JG, Obradovic Z, Dunker AK: The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res. 2004, 32: 1037-1049. 10.1093/nar/gkh253.PubMedPubMed CentralView ArticleGoogle Scholar
- Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradovic Z: Intrinsic disorder and protein function. Biochemistry. 2002, 41: 6573-6582. 10.1021/bi012159+.PubMedView ArticleGoogle Scholar
- Jimenez JL, Hegemann B, Hutchins JR, Peters JM, Durbin R: A systematic comparative and structural analysis of protein phosphorylation sites based on the mtcPTM database. Genome Biol. 2007, 8: R90-10.1186/gb-2007-8-5-r90.PubMedPubMed CentralView ArticleGoogle Scholar
- Blom N, Gammeltoft S, Brunak S: Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol. 1999, 294: 1351-1362. 10.1006/jmbi.1999.3310.PubMedView ArticleGoogle Scholar
- Obenauer JC, Cantley LC, Yaffe MB: Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res. 2003, 31: 3635-3641. 10.1093/nar/gkg584.PubMedPubMed CentralView ArticleGoogle Scholar
- Noble WS: What is a support vector machine?. Nat Biotechnol. 2006, 24: 1565-1567. 10.1038/nbt1206-1565.PubMedView ArticleGoogle Scholar
- Kim JH, Lee J, Oh B, Kimm K, Koh I: Prediction of phosphorylation sites using SVMs. Bioinformatics. 2004, 20: 3179-3184. 10.1093/bioinformatics/bth382.PubMedView ArticleGoogle Scholar
- Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, Apweiler R: The International Protein Index: an integrated database for proteomics experiments. Proteomics. 2004, 4: 1985-1988. 10.1002/pmic.200300721.PubMedView ArticleGoogle Scholar
- Wagner M, Adamczak R, Porollo A, Meller J: Linear regression models for solvent accessibility prediction in proteins. J Comput Biol. 2005, 12: 355-369. 10.1089/cmb.2005.12.355.PubMedView ArticleGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.PubMedPubMed CentralView ArticleGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22: 2577-2637. 10.1002/bip.360221211.PubMedView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.PubMedView ArticleGoogle Scholar
- Swissprot. [http://expasy.org]
- Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, et al: SGD: Saccharomyces Genome Database. Nucleic Acids Res. 1998, 26: 73-79. 10.1093/nar/26.1.73.PubMedPubMed CentralView ArticleGoogle Scholar
- Grumbling G, Strelets V: FlyBase: anatomical data, images and queries. Nucleic Acids Res. 2006, D484-488. 10.1093/nar/gkj068. 34 Database
- O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 2005, D476-480. 33 Database
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48: 443-453. 10.1016/0022-2836(70)90057-4.PubMedView ArticleGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000, 16: 276-277. 10.1016/S0168-9525(00)02024-2.PubMedView ArticleGoogle Scholar
- Hornbeck PV, Chabra I, Kornhauser JM, Skrzypek E, Zhang B: PhosphoSite: A bioinformatics resource dedicated to physiological protein phosphorylation. Proteomics. 2004, 4: 1551-1561. 10.1002/pmic.200300772.PubMedView ArticleGoogle Scholar
- Diella F, Cameron S, Gemund C, Linding R, Via A, Kuster B, Sicheritz-Ponten T, Blom N, Gibson TJ: Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinformatics. 2004, 5: 79-10.1186/1471-2105-5-79.PubMedPubMed CentralView ArticleGoogle Scholar
- Nielsen H, Brunak S, von Heijne G: Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Engineering. 1999, 12: 3-9. 10.1093/protein/12.1.3.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.