- Open Access
Genomic analysis of the eukaryotic protein kinase superfamily: a perspective
Genome Biology volume 4, Article number: 111 (2003)
Protein kinases with a conserved catalytic domain make up one of the largest 'superfamilies' of eukaryotic proteins and play many key roles in biology and disease. Efforts to identify and classify all the members of the eukaryotic protein kinase superfamily have recently culminated in the mining of essentially complete human genome data.
Phosphorylation by protein kinases is recognized as a major mechanism by which virtually every activity of eukaryotic cells is regulated, including proliferation, gene expression, metabolism, motility, membrane transport, and apoptosis. An ultimate goal of research into signal transduction is to reach a full understanding of the protein phosphorylation events that occur within individual cell types and how they eventually impact on cell behavior. A milestone en route to this ambitious goal is a determination of the number of protein kinases encoded by eukaryotic genomes and an assessment of their structures, functions, and evolutionary relationships. This article traces the progress made toward achieving these objectives in the pregenomic and genomic eras, which culminated recently with reports on the 'full complement' of human protein kinases.
The pregenomic era
About sixteen years ago, while working at the Salk Institute, my colleagues and I undertook a comparative analysis of all the available sequences of protein kinase catalytic domains . This interest stemmed from my having identified several novel human protein kinases using a homology-based cDNA cloning strategy  and wanting to determine their relationships to other known protein kinases. In collaboration with the Salk's resident protein kinase guru Tony Hunter and biocomputing specialist Anne Marie Quinn, we aligned the homologous catalytic-domain amino-acid sequences of 65 distinct protein kinases from diverse eukaryotes (including 45 nonorthologous vertebrate enzymes) and constructed a phylogenetic tree to visualize their overall relationships . The alignment (produced manually at the word-processor) defined the boundaries of the eukaryotic protein kinase (ePK) catalytic domain, revealed conserved subdomains that were never interrupted by amino-acid insertions, and identified highly conserved individual amino acids and motifs (Figure 1).
The phylogenetic tree revealed major clusters including the tyrosine kinases (the TK group), cyclic nucleotide- and calcium-phospholipid-dependent kinases (the AGC group; including the PKA, PKG, and PKC families) and calmodulin-dependent kinases (the CAMK group). These groupings indicated that ePK domain phylogeny reflects substrate specificity and/or mode of regulation and could therefore serve as a useful classification tool. Over the next 7 years I continued to add new sequences to the alignment as they became available and to construct phylogenetic trees as a means of classifying the burgeoning ePK superfamily. By early 1994, the ePK domain alignment had grown to contain 390 sequences including 205 non-orthologous vertebrate ePKs, and a fourth major ePK group (CMGC, comprising the CDK, MAPK, GSK, and CLK families) had been added through phylogenetic analysis . The 390 ePK domain alignment was made publicly available through the Protein Kinase Resource website .
The genomic era
By 1995, with the advent of genome-sequencing projects, the task of cataloging and classifying the members of the ePK superfamily had grown to become too distracting from my funded research and I discontinued my efforts in this area. Tony Hunter continued to work with bioinformaticians at SUGEN, Inc. (including Greg Plowman, Gerard Manning, and Sucha Sudarsanam) to characterize the full ePK complements of model eukaryotes from genomic sequence data [5, 6]. By the time of a recent report , their efforts had resulted in the identification and classification of 115 distinct ePKs from budding yeast (around 2% of all genes), 434 from Caenorhabditis elegans (about 2.5% of all genes), and 223 from Drosophila. In addition they described the complement of 'atypical protein kinases' (aPKs) from these species: 15 from yeast, 20 from C. elegans, and 16 from Drosophila. (The aPKs are a variety of protein kinases that lack strong sequence similarity to the classical ePK domain but have been shown experimentally to have protein kinase activity; well-known examples are the 'lipid kinases' of the phosphatidylinositol 3'-kinase (PI3K) family, some of which have been shown experimentally to have protein kinase activity.)
As a result of their comprehensive analyses of 'kinomes', the SUGEN investigators were able to define three new major groups within the broad ePK classification scheme: first, the STE group, which includes ePKs that function in the MAPK kinase cascades that were first described through characterization of yeast sterile mutants; second, the CK1 group, including the casein kinase 1 family and related enzymes, which is greatly expanded in the worm; and third, the TKL ('tyrosine-kinase like') group that includes the STKR family of TGFbeta serine/threonine kinase receptors and is phylogenetically close to the tyrosine kinases (TKs). Many distinct kinase families within the AGC, CAMK, CMGC, STE, and CK1 groups have representatives from all three species, supporting the idea of an early evolutionary origin and critical function in basic cellular processes. Members of the TK and TKL groups are notably absent from yeast, consistent with the known functions of these ePKs in intercellular signaling events associated with metazoan complexity. More discussion of the evolutionary relationships among the ePKs identified through the SUGEN genome-mining efforts has been published elsewhere . The SUGEN kinase.com website  includes links to all their published work on protein kinase analysis as well as 'KinBase', a very useful searchable database that holds information on all the protein kinase genes found in the yeast, worm, fly, and human (see below) genomes.
Human protein kinases
The completion of the first draft of the human genome sequence presented an opportunity to determine the full complement of human protein kinases. The first analysis came from a group led by Mitch Kostich at Schering-Plough Research Institute (SPRI) . This group mined public GenBank records (available before December, 2001) for ePK sequences by performing BLAST searches using known ePK domains as queries. The resulting hits were consolidated, and efforts were made to remove non-human sequences, pseudogenes, and poor-quality sequences that could represent duplicate hits. The SPRI investigators chose to err on the side of inclusion rather than exclusion, however, and many cases of 'single hit' sequences were retained. Their effort culminated in a collection of 510 potentially unique human ePKs. A color-coded alignment that accompanied their article  nicely illustrates the ePK domain sequence conservation.
The SUGEN group, led by Gerard Manning and Sucha Sudarsanam, carried out a more comprehensive effort to describe and classify all human ePKs . They employed a dataset that included, in addition to the public databases, genomic reads from Celera that are not publicly available, non-public expressed sequence tags (ESTs) from Incyte and SUGEN, and they searched using a hidden Markov model of the ePK domain that allowed detection of very divergent family members. The sequence data were further searched for members of the various known aPK families. Using stringent criteria to eliminate false positives (including verification of novel sequences by cDNA cloning) they compiled a list of 478 human members of the ePK superfamily and another 40 aPKs, bringing their human kinome total to 518 (approximately 1.7% of all predicted human genes). They also identified 106 ePK or aPK pseudogenes.
A comparison of the SPRI-510 and SUGEN-518 lists reveals 474 protein kinases in common (see the additional file). Of the 44 SUGEN-specific kinases, 32 are aPKs; the other 8 aPKs identified by SUGEN, from the ABC1 and RIO families, were included in the SPRI list as a result of their having weak ePK domain similarity. Of the remaining 12 SUGEN-specific ePKs, five (TAK1, MLKL, NEK5, SgK307, and TBCK) were not available in the public data used in the SPRI analysis; another five (SgK196, SgK223, SgK424, SgK493, and Slob) have rather divergent ePK domains that lack many of the highly conserved residues and are unlikely to have catalytic activity, so it is easy to see how these might have been excluded by visual inspection; and the final two are SgK110 and NEK10. SgK110 was actually detected by the SPRI search, but it was erroneously merged with a related sequence AC008735_EPK1 (SgK069) on the same genomic contig; and it is unclear why the SPRI group missed NEK10. Most, if not all, of the 36 SPRI-specific ePKs represent over-inclusion errors (Table 1): 14 correspond to sequences determined to be pseudogenes by the SUGEN group; 19 are based on single sequences that are (or appear to be) either poor-quality duplicates of other ePKs or interspecies contaminants; and the remaining three are duplicates arising by virtue of non-overlapping partial sequences.
Thus the SUGEN compilation of 478 human ePK superfamily genes represents the accurate count based on current sequence data. If one subtracts those that lack key conserved residues, we are left with 428 human ePKs with known or likely kinase function (Table 2), 99% of which were included in the SPRI list; 365 of these fall within the seven major ePK groups: TK, 84 in total; CAMK, 66; AGC, 61; CMGC, 61; STE, 45; TKL, 37; and CK1, 11. The remaining 63 are in the 'Other' category, falling outside the main ePK group branches. Krupa and Srinivasan  have also recently searched the public human genome data with a focus on identifying functional protein kinases; their efforts resulted in a list of 448 distinct human ePK sequences, but around 90 of these appear to represent duplicate entries, and no novel protein kinases were identified that were not present in the SUGEN compilation.
Usefulness of the kinome data
Knowing the full complement of ePK family members and functional ePKs encoded by eukaryotic genomes will have great impact upon many areas of scientific investigation. As mentioned above, an obvious benefit relates to understanding of how signal transduction pathways evolved during the course of eukaryotic evolution. Both SUGEN  and Krupa and Srinivasan  extended their analyses to describe other domains present in the various human ePKs which are likely to function in directing the enzymes to relevant substrates or modulating kinase activities. Further analysis of the ePK domain sequences uniquely conserved within the major groups and families, together with comparisons of ePK domain crystal structures, should ultimately allow a full understanding of how different classes of peptide substrate are recognized. For example, Figure 2 shows consensus sequences for the catalytic loop region in subdomain VIB (which includes the invariant aspartate thought to function as the catalytic base) and the activation loop region in subdomain VIII (which includes the highly conserved glutamine in the 'APE' motif) - two regions that have been recognized as being primarily involved in peptide-substrate recognition [12, 13]. A number of group-specific differences are apparent (highlighted in Figure 2) that correlate with unique peptide-recognition tendencies for the ePKs that fall within a given group . Beyond sequence analysis, the kinome data will allow for the development of comprehensive tools (such as full-length cDNAs, microarrays, antibodies, and fusion protein and RNAi constructs) that will greatly aid laboratory investigations aimed at understanding cell signaling through analysis of kinase function. As an example of such proteomic approaches to the study of protein kinases, nearly all yeast protein kinases have been expressed in bacteria and analyzed for their ability to phosphorylate an array of protein or peptide substrates using protein-chip technology . Finally, the human kinome data will have benefits in the understanding and treatment of human diseases. The ePK genes that map within disease loci are attractive etiological candidates, and knowledge of the full repertoire of human protein kinases will greatly aid in the development of drugs that target specific protein kinases or protein kinase families whose function contributes to disease-associated cellular defects.
Additional data file
An additional data file with the 474 protein kinases in common between the SPRI-510 and the SUGEN-518 lists is available.
Hanks SK, Quinn AM, Hunter T: The protein kinase family: conserved features and deduced phylogeny of the catalytic domains. Science. 1988, 241: 42-52.
Hanks SK: Homology probing: identification of cDNA clones encoding members of the protein-serine kinase family. Proc Natl Acad Sci USA. 1987, 84: 388-392.
Hanks SK, Hunter T: The eukaryotic protein kinase superfamily: kinase (catalytic) domain structure and classification. FASEB J. 1995, 9: 576-596.
Protein Kinase Resource. [http://kinases.sdsc.edu/html/index.shtml]
Hunter T, Plowman GD: The protein kinases of budding yeast: six score and more. Trends Biochem Sci. 1997, 22: 18-22. 10.1016/S0968-0004(96)10068-2.
Plowman GD, Sudarsanam S, Bingham J, Whyte D, Hunter T: The protein kinases of Caenorhabditis elegans: a model for signal transduction in multicellular organisms. Proc Natl Acad Sci USA. 1999, 96: 13603-13610. 10.1073/pnas.96.24.13603.
Manning G, Plowman GD, Hunter T, Sudarsanam S: Evolution of protein kinase signaling from yeast to man. Trends Biochem Sci. 2002, 27: 514-520. 10.1016/S0968-0004(02)02179-5.
Kostich M, English J, Madison V, Gheyas F, Wang L, Qiu P, Greene J, Laz TM: Human members of the eukaryotic protein kinase family. Genome Biol. 2002, 3: research0043.1-12. 10.1186/gb-2002-3-9-research0043.
Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S: The protein kinase complement of the human genome. Science. 2002, 298: 1912-1934. 10.1126/science.1075762.
Krupa A, Srinivasan N: The repertoire of protein kinases encoded in the draft version of the human genome: atypical variations and uncommon domain combinations. Genome Biol. 2002, 3: research0066.1-0066.14. 10.1186/gb-2002-3-12-research0066.
Taylor SS, Radzio-Andzelm E, Hunter T: How do protein kinases discriminate between serine/threonine and tyrosine? Structural insights from the insulin receptor protein-tyrosine kinase. FASEB J. 1995, 9: 1255-1266.
Johnson LN, Lowe ED, Noble MEM, Owen DJ: The structural basis for substrate recognition and control by protein kinases. FEBS Lett. 1998, 430: 1-11. 10.1016/S0014-5793(98)00606-1.
Kreegipuu A, Blom N, Brunak S, Järv J: Statistical analysis of protein kinase specificity determinants. FEBS Lett. 1998, 430: 45-50. 10.1016/S0014-5793(98)00503-1.
Zhu H, Klemic JF, Chang S, Bertone P, Casamayor A, Klemic KG, Smith D, Gerstein M, Reed MA, Snyder M: Analysis of yeast protein kinases using protein chips. Nat Genet. 2000, 26: 283-289. 10.1038/81576.
Janji B, Melchior C, Vallar L, Kieffer NL: Cloning of an isoform of integrin-linked kinase (ILK) that is upregulated in HT-144 melanoma cells following TGF-beta1 stimulation. Oncogene. 2000, 19: 3069-3077. 10.1038/sj.onc.1203640.
The author has declared that he has no affiliation with SUGEN or Schering-Plough.
I am indebted to Gerard Manning of SUGEN, Mitch Kostich of Schering-Plough Research Institute, and N. Srinivasan of the Indian Institute of Science, for their contributions and comments regarding comparative analysis of their respective human protein kinase compilations.