The repertoire of protein kinases encoded in the draft version of the human genome: atypical variations and uncommon domain combinations
© Krupa and Srinivasan, licensee BioMed Central Ltd 2002
Received: 1 July 2002
Accepted: 11 October 2002
Published: 13 November 2002
Phosphorylation by protein kinases is central to cellular signal transduction. Abnormal functioning of kinases has been implicated in developmental disorders and malignancies. Their activity is regulated by second messengers and by the binding of associated domains, which are also influential in translocating the catalytic component to their substrate sites, in mediating interaction with other proteins and carrying out their biological roles.
Using sensitive profile-search methods and manual analysis, the human genome has been surveyed for protein kinases. A set of 448 sequences, which show significant similarity to protein kinases and contain the critical residues essential for kinase function, have been selected for an analysis of domain combinations after classifying the kinase domains into subfamilies. The unusual domain combinations in particular kinases suggest their involvement in ubiquitination pathways and alternative modes of regulation for mitogen-activated protein kinase kinases (MAPKKs) and cyclin-dependent kinase (CDK)-like kinases. Previously unexplored kinases have been implicated in osteoblast differentiation and embryonic development on the basis of homology with kinases of known functions from other organisms. Kinases potentially unique to vertebrates are involved in highly evolved processes such as apoptosis, protein translation and tyrosine kinase signaling. In addition to coevolution with the kinase domain, duplication and recruitment of non-catalytic domains is apparent in signaling domains such as the PH, DAG-PE, SH2 and SH3 domains.
Expansion of the functional repertoire and possible existence of alternative modes of regulation of certain kinases is suggested by their uncommon domain combinations. Experimental verification of the predicted implications of these kinases could enhance our understanding of their biological roles.
Many cellular stimuli are channeled through a number of signaling molecules that result in alterations in the transcriptional and translational status of gene(s) to bring about the desired response. Reversible protein phosphorylation is one of the key mechanisms commonly used in signal transduction to alter the functional states of the signaling proteins. Covalent attachment of a phosphate group to target proteins by protein kinases modulates the functional status of a myriad of proteins involved in cellular signaling networks. Protein kinases are essential to the regulation of diverse cellular processes including metabolism, stress responses, transcription, translation, DNA replication and cell-cycle-control [1,2]. In the multicellular eukaryotes they are also involved in more evolved functions such as organ and limb development, neuronal signaling, apoptopsis and cell-cell communication [3,4]. Abnormal functioning of protein kinases has been implicated in a large number of human diseases and many of them have been identified as proto-oncogenes [5,6].
Serine/threonine kinases and tyrosine kinases make up one of the largest known protein families. Commonly referred to as the eukaryotic protein kinases (ePKs), these share a common three-dimensional fold . The catalytic domain comprises an amino-terminal lobe containing mostly β strands and a carboxy-terminal lobe containing mainly α helices. The ATP-binding and substrate-binding sites are located in and around the cleft between the two lobes. The ePKs have been further classified into various subfamilies on the basis of the sequence similarity of the catalytic kinase domains [1,8,9]. However, the members of a subfamily often have similar substrate specificities and modes of regulation at a general level. ePKs are regulated by diverse means that include binding of an additional subunit, autophosphorylation and interaction with other domains within the polypeptide chain .
Using the first drafts of the human genome data , we report here on the repertoire of protein kinases in the human genome and their domain structures. We have classified the human kinases following Hanks and Hunter's [1,8,9] classification scheme and present them online at our kinases in genomes (KinG) website  and as Additional data files. This list of kinases has been carefully arrived at after eliminating sequences that lack critical functional residues, in particular the ATP-binding glycine-rich motif and the catalytic aspartate . These residues are indicated in the multiple sequence alignments of catalytic domains given in the additional data files. Fragments of protein kinases of less than 200 residues and lacking the functional residues have not been considered in the analysis. These include 24 gene products in the list with less than 200 residues that are probably pseudogenes. Hence, all the kinases used in the analysis are likely to be active kinase gene products and are unlikely to be psuedogenes. However, this paper discusses particularly those members of the kinase family with unusual domain combinations and those variants not studied so far, to the best of our knowledge, by experiment. We have compared the human kinase gene products with those encoded in the genomes of fruit fly , worm  and yeast . A search in the human genome has also been carried out to identify the homologs of various subfamilies of bacterial protein kinases .
Results and discussion
From our survey of the human genome using various sensitive family-profile search methods, the data available in public databases, the presence of functional sequence motifs and manual analysis (see Materials and methods), we chose a dataset of 448 sequences for analysis of their domain structures and analysis of sequences not studied experimentally so far. We originally found 556 sequences of sufficient length and showing statistically significant E-values in the alignment with bona fide protein kinases. Although these sequences are expected to adopt the fold of a protein kinase, 108 of them lacked the glycine-rich motif or the catalytic base aspartate, or both, and hence do not have the essential requirements of a functional protein kinase (see Materials and methods for details). We have not considered these 108 sequences in the current analysis and confined it to 448 clear-cut cases. In addition, 11 lipid kinases have also been identified; these belong to the class of phosphoinositide-3 and phosphoinositide-4 kinases having structural similarity to the catalytic domain of protein kinases. The list of all the putative protein kinases used in this analysis, their classification into groups, multiple sequence alignment with indications of critical functional residues and the domain structures are given at  and in the Additional data files. The completed genomes of Drosophila melanogaster, Caenorhabditis elegans and Saccharomyces cerevisiae have also been surveyed and we used datasets containing 257, 474 and 122 gene products, respectively, containing regions with significant sequence similarity to the catalytic kinase domain. The number of protein kinases in the genomes analyzed here varies slightly from those reported by other groups [11,13,14,15,17] and in the public databases. This disparity is likely to be due to the sensitivity of the search methods and frequent updating of the draft version of the human genome. In the current analysis, the protein kinase homologs have been detected using a variety of sensitive methods and also by manual inspection to exclude most of the kinase-like sequences lacking functionally critical residues and those gene products that are very short. Our present study is focused only on the clear cases that are unlikely to be altered as a result of further changes in the dataset. In particular, we have used a stringent criterion in our search procedures (see Materials and methods) to avoid false positives. We have also consulted the datasets of kinases extracted from websites such as Superfamily [18,19], InterPro  and http://kinase.com , as well as ensuring the presence of functional sequence motifs, in order to decide on the list of kinases that is here.
Most human protein kinases have been classified into major subfamilies proposed by Hanks and Hunter [1,8,9], as described in Materials and methods. The most highly populated subfamilies are CAMK, receptor/non-receptor tyrosine kinases, CMGC and AGC (see Materials and methods for abbreviations) containing 81, 94, 72 and 69 proteins respectively. The casein kinase subfamily has only 12 members in the human genome. The protein kinases not included in the above subfamilies belong to other subfamilies, which include the polo family, MEK, MEKK, PAK, NimA, mixed-lineage kinase (MLK), activin-TGFβ receptors, Wee1-Mik1, Raf and kinases involved in translational control. A detailed analysis of the members of the classical kinase subfamilies will be published elsewhere. Recent studies have identified four novel subfamilies of eukaryotic protein kinase-like sequences in bacteria  and homologous members of these subfamilies have been identified in the human genome. They include three homologs from the RIO1 subfamily, and eight and four members from the PID261 and ABC1 subfamilies, respectively. No homologs have been identified for the AQ578 subfamily of the bacterial protein kinases.
Protein kinases with uncommon combination of domains
The high degree of functional diversification among the protein kinases is made possible by their ability to interact with large numbers of cellular proteins. These interactions are mediated through additional subunits or domains of the kinase that are regulatory or act as protein-interaction modules. The functions of these non-catalytic domains thus suggest the biological roles of the kinase and the specificity of the proteins or other ligands that bind to their specific domains. We identified 89 single-domain kinases encoded in the human genome, which are probably regulated by separate subunits. Clearly, a majority (359 sequences - about 80%) of the human protein kinases contain at least one domain other than the catalytic kinase domain.
A protein kinase (ENSP232797) with two HEAT and six WD repeats following the catalytic domain has been identified (Figure 1d). The HEAT motifs are known to form α-helical repeats, resulting in suprahelical structures that serve as protein-recognition interfaces  as understood for the PR65 subunit of phosphatase 2A . The domain structure and the sequence pattern of the catalytic domain of this protein kinase are distinct from those of known kinase subgroups, suggesting a distinct subgroup.
The protein kinases involved in the regulation of protein translation are activated in various ways. In addition to the well-known members of this class, namely eIF2α kinase (ENSP242081)  and protein kinase-R (PKR) (ENSP233057; Figure 1g) , the current study revealed the presence of a protein kinase with a single-strand RNA-binding motif (RRM) (Figure 1e). This kinase (ENSP235784) is known to be associated with the stathmin gene implicated in leukemia . The presence of RRM suggests that RNA binding could be influential for its activity and suggests a probable role in the phosphorylation of proteins controlling translation or RNA-associated proteins.
A DMPK (myotonic dystrophy kinase)-like protein kinase (ENSP216542; Figure 1c)  has a phorbol-ester binding (DAG_PE) domain followed by a pleckstrin homology (PH) domain and a CNH domain. The catalytic domain has high similarity to the kinases of the AGC group. The presence of the PH domain may suggest an association with the membrane. The PH domain could also mediate interaction with small G proteins, as DMPK is known to interact with Cdc42, a GTPase proximal to the membrane . Hence, the domain combination of this kinase is consistent with a regulatory mechanism mediated by the Cdc42 GTPase and its action downstream in the small G-protein-mediated pathways.
Another kinase encoded in the human genome (gi13652062) shows similarity to the mitogen-activated protein kinase kinase (MAPKK) catalytic domain and contains an octicosapeptide repeat (OPR) domain which is known to occur in an isoform of protein kinase C (PKC) . This divalent cation-binding domain suggests an influence of divalent cations in mitogenic signaling.
The Trio kinase (triple-function domain) (ENSP252004)  belonging to the CAMK group is unique in having two guanine-nucleotide-exchange factor (GEF) domains, one specific for Rho and the other for Rac (Figure 1k). As the amino-terminal GEF is involved in the activation of Jun kinases and in the production of membrane ruffles , it appears that the kinases upstream to Jun kinases could be the probable targets of the Trio kinase. Another member of the CAMK group (gi13651132; not shown in Figure 1) contains SH3, PDZ and guanylate kinase domains. The PDZ domain may mediate the membrane-association property of this kinase. Whereas the SH3 domain might mediate interprotein or interdomain interactions, it is intriguing to note the existence of a potential catalytic guanylate kinase domain alongside a protein kinase domain.
A protein kinase of the CMGC group (ENSP234972) with eight ankyrin repeats at the amino terminus has the characteristic sequence of the C-helix of cyclin-dependent kinases (CDKs) conserved in the catalytic domain (Figure 1j). Interaction of the CDK inhibitors belonging to the p19INK family occurs through the ankyrin repeats of the inhibitor , and the inhibitors exist as separate genes (not fused to the catalytic kinase domain as in ENSP234972). Hence, the role of ankyrin domains in this member of the CDK-like kinase subfamily of the CMGC group could be intramolecular regulation.
Also identified were CDK-7-related kinase (ENSP234626) with an insertion of 173 residues in the activation loop that continues as the subdomain X, following which is another insertion of 113 residues that extends as the carboxy-terminal end of the catalytic domain. Serine/threonine kinase-23 of the CMGC group has an insertion of 134 residues between the catalytic loop and activation loop and a second insertion occurs following the subdomain X, which is 86 residues long. An unannotated kinase gene product corresponding to the TREMBL entry tr |CAC39299| has an insertion of 144 residues between the catalytic loop and the activation loop. The dual-phosphorylation-regulated kinase homolog (ENSP247840) also has an insertion between the X subdomain and the carboxy-terminal end of the catalytic domain. These insertions could constitute a separate domain and probably confer different substrate specificity to the kinases, as they are located in the carboxy-terminal lobe.
Hypothetical protein kinases not studied experimentally so far
Two protein kinases (ENSP207880 and ENSP188359) (Figure 2a,g) show repeat motifs of the ARM repeats family. However, the catalytic domains of one of them (Figure 2g) are similar to the AGC subfamily, whereas the other kinase does not seem to belong to any of the known subfamilies. The ARM repeat is evolutionarily related to the HEAT repeat, and they usually act as scaffold modules enabling interaction with other proteins .
A protein kinase (ENSP251197; Figure 2j) with a PDZ domain and regions homologous to mouse kinases (MH1 and MH2) is currently known only from the human and mouse genomes. The kinase domain of this gene product has a sequence identity of 71% to the microtubule-associated testis-specific serine/threonine kinase (MAST-205) . MAST-205 is known to be associated with microtubules that are part of contractile structures. The regions at the amino and carboxyl termini of the kinase are homologous with the regions flanking the kinase domain of the syntrophin-associated protein kinase implicated in muscular dystrophy . Translocation of the proteins constituting the dystrophy complex is aided by the PDZ domains of syntrophins that interact with the syntrophin-associated kinase . From the domain combination observed in ENSP251197, a role for this protein kinase in the phosphoregulation of protein components of the dystrophy complex cannot be precluded. Assignment of biological function by virtue of similar domain combinations is consistent with previous studies  where conservation of domain order in proteins of similar domain combinations is suggested as an important means of functional conservation among various multidomain proteins. However, definite implication of ENSP251197 in the regulation of proteins of the dystrophy complex will require experimental studies.
Also included in this list is the protein kinase ENSP250887 (Figure 2i), which is closely related to HrPOPK1 of Halocynthia roretzi and Lok1p protein kinase of Drosophila. Both these are implicated in the establishment of the embryonic axis [40,41], suggesting a probable involvement of this human kinase in embryonic development.
A tyrosine kinase (ENSP251707, Figure 2k) with a predicted transmembrane region and similarity to the mouse apoptosis-associated tyrosine kinase  is identified in the human genome, suggesting a probable role of this human tyrosine kinase in apoptosis. If a search of the sequence databases is made, a large number of different tyrosine kinases appear as hits, following the mouse apoptosis-associated tyrosine kinase as the top hit.
A homolog of the mouse BMP2-inducible protein kinase (ENSP251485, Figure 2h) has also been identified in the human genome. The BMP2-inducible kinase has a regulatory role in osteoblast differentiation , suggesting a similar role for the human kinase.
Non-kinase domains associated with human protein kinases
There are 61 distinct non-kinase domains associated with the catalytic domains of the human protein kinases. A few domains are shared by more than one protein kinase whose catalytic domains belong either to the same subfamily (such as the AGC group) or to different subfamilies. In the current study we have also investigated the extent of similarity in the non-kinase domains shared by protein kinases with catalytic domains from different subfamilies. The domains analyzed include DAG_PE, SH2, SH3 and PH.
The phylogenetic analysis of protein kinases and their associated domains was carried out as described in Materials and methods. Phylogenetic trees were obtained using CLUSTAL W . The SH3 and SH2 domains are usually involved in the recruitment of proteins to their specific targets. In the case of protein tyrosine kinases, however, they also have regulatory roles. The SH2 domains of the non-receptor tyrosine kinases are known to translocate the corresponding kinases to specific receptors or other proteins associated with receptors, which activate them. So the specificity of these tyrosine kinases is also dependent on the adaptor domains that target them to specific proteins.
In the protein kinases, diacylglycerol-binding (DAG_PE) domains are restricted to the AGC and Raf subfamilies. The PKC isoforms β, γ, θ, η, ε and μ have two DAG_PE domains in tandem and amino-terminal to the kinase catalytic domain. The amino-terminal DAG_PE domains of β, γ, θ, η and ε isoforms form a distinct group with sequence identity greater than 68%, whereas the carboxy-terminal DAG_PE domains of the β, ε, θ, η and γ cluster together (65%) (dendrogram not shown). The PKCμ isoform is an exception where both the amino- and carboxy-terminal DAG_PE domains cluster together (sequence identity 54%) with the carboxy-terminal DAG_PE clusters.
The occurrence of significantly diverse homologous domains in the same protein kinase, as evident from their distinct clustering, could also suggest the involvement of such kinases in the cross-talk between different signaling pathways. Such domain diversity could imply an involvement of such kinases in more than one pathway, serving as upstream and downstream effectors for more than one signaling cascade.
Human protein kinases with no detectable closely related proteins encoded in the fruit fly and worm genome databases
In view of the gene duplications across various genomes there is a possibility of expansion of certain subfamilies of the protein kinases, in one genome compared to another genome. This section discusses only the extreme cases where no homologs with significant similarity in sequence as well as domain composition to the protein kinase encoded in the human genome could be identified in the genomes of two lower multicellular eukaryotes of complete genome data. Such an analysis of the human protein kinases has enabled us to identify protein kinases that are potentially specific to vertebrates. We have compared the human protein kinases with those of two other multicellular organisms, D. melanogaster and C. elegans.
As protein tyrosine kinases are involved in sophisticated functions such as cellular differentiation and proliferation, one would expect the complexity of the signaling in higher organisms could be attributed partly to them. The apoptosis-associated tyrosine kinase (AAYK) , Fl-cytokine receptor kinase, c-Fms  and mast/stem-cell growth factor receptor kinase  identified in the human genome could not be detected in the worm and fruit fly genomes. The closest homologs of AAYK are neurospecific receptor kinase (NRK) with Fz and kringle domains in fly (30% identity with the kinase domain) and an open reading frame (ORF) CE00747 with laminin, epidermal growth factor (EGF) and FN3 domains in worm (51% identity with the kinase domain). However, no distinct functional domains other than the catalytic kinase domain could be identified in AAYK. FL-cytokine receptor also seems to lack a functional counterpart in the fly and worm genomes, although the closest homologs are platelet-dependent growth factor (PDGF) receptor and vascular endothelium growth factor (VEGF) receptor-related kinase (PVR) and fibroblast growth factor (FGF) receptor family member ORF CE28238 in the fly and worm genomes, respectively. However the domain organization of PVR and CE28238 are distinct from that found in the Fl-cytokine receptors. A functional homolog of the c-Fms and the mast/stem-cell growth factor receptor kinase is apparently absent in the fly and the worm genomes. The closest sequence neighbors of these c-Fms and mast/stem-cell growth factor receptor kinases are the FGF receptor kinase and PVR, respectively, in the fly, whose domain structures are different from that of the c-Fms and the mast/stem-cell growth factor receptor kinase.
All the above-mentioned kinases are involved in highly specialized cytokine- and growth-factor-mediated signaling pathways, and a few are specific for hematopoietic cells (mast/stem-cell receptor kinase) and immune responses (macrophage-stimulated receptor kinase), which explains their absence from lower multicellular organisms.
Certain members of a group of kinases involved in the protein translation, such as heme-regulated initiation (HRI) factor kinase (gi11125768)  and interferon-induced dsRNA-dependent kinase (PKR) (ENSP233057) , are at present found only in the human genome among the completed and publicly available genomes. This is consistent with the interferon-induced signaling pathways and the heme-regulated pathways being specific for vertebrates, and hence the lack of their effector domains in lower organisms. The domain structures of the homologs of the HRI kinase and PKR in worm and fly, eukaryotic initiation factor kinase eIF2α-kinases, are different from those found in higher organisms. The worm and fly have two repeats of bacterial PQQ domains preceding the catalytic kinase domain in the sequence. In the case of human HRI kinase, no functional domain can be assigned to this region, although heme binding to this region to influence its activity is suggested. The human dsRNA-dependent kinase has two consecutive dsRNA-binding motifs preceding the catalytic kinase domain in the sequence. This is also consistent with its activity being influenced by interferon, an antiviral agent. Binding of the viral RNA to the dsRNA-binding motifs of the kinase could be regulated by interferon, and subsequently trigger a signaling cascade designed to evade viral infection .
The possible absence of such kinases in lower organisms such as fly and worm implies the existence of an alternative mode of defense against viral infection. Hence, the occurrence of heme-binding as well as RNA-binding kinases in higher organisms suggests that the translation machinery of these organisms has been associated in more than one way to the protein kinases, linking them to the various signaling cascades.
Another important class of kinases, the receptor-interacting protein kinases (RIPKs) (ENSP220751), associated with receptors triggering apoptosis  were identified in the human genome but could not be found in invertebrates. The closest homologs of the RIPKs in the worm genome are the Raf kinases, which lack the caspase-recruitment domain (CARD) found in RIPKs. MAPKKK (TAK1) in the fly genome is the closest homolog of the RIPKs, and also lacks a CARD domain. This suggests that the components involved in controlling the signaling pathways mediating cell death in humans could be different from those in lower organisms. RIPKs with a CARD domain in higher organisms must have evolved to provide additional regulation to the apoptotic signaling pathways, and could also act as a link between its substrates and other components of the apoptotic machinery.
These potentially unique kinases therefore may be responsible for fine-tuning of the most highly specialized signaling pathways in humans. They may hence lack counterparts in the lower organisms, or their functions are carried out by some subgroups of enzymes. It is of interest to search for these kinases in other higher eukaryotic organisms such as mouse as and when the data is made freely available.
The functional diversification of the kinases in the human genome, to meet the requirements of complex network of signaling pathways, is quite apparent from the analysis. The functional repertoire seems to be further expanded by protein kinases of uncommon domain composition. Shuffling of signaling domains or modules among gene products containing the various catalytic kinase domains seems to be one of the obvious means of generating such diverse biological roles. Identification of protein kinases with uncommon domain combinations suggests the possible existence of alternative modes of regulation of a few protein kinases, which appear to be distinct from the previously known and well characterized regulatory strategies.
The functional promiscuity of the various modules involved in signal transduction enables diverse biological interactions to be achieved using a limited number of modules. The occurrence of combinations of these modules would narrow the range of functional diversity, however. Given the functional diversity associated with these modules, their co-occurrence with domains of distinct known function, investigated using a computational approach in the current study, enables us to predict the biological implications of such uncommon domain combinations in relation to the overall functions of protein kinases. Experimental verification of these predictions could enhance our understanding of the specific biological roles of these protein kinases.
Materials and methods
The complete set of predicted protein sequences from the open reading frames (ORFs) of the human genome (build 24; 3 July 2001) was obtained from the National Center for Biotechnology Information (NCBI)  and from Ensembl (version 1.1.0; April 2001 ) databases. Using sensitive sequence-profile matching algorithms, we have searched for kinases in both these versions of the human genome as there are differences between them. An overwhelming majority of the recognized kinases are common to the NCBI and Ensembl versions, but we have identified distinct kinases by comparing the two lists. Genome data from S. cerevisiae and D. melanogaster have been obtained from  and C. elegans from the Sanger Centre, UK .
We have used multiple sensitive sequence search and analysis methods - PSI-BLAST , IMPALA  and HMMer, which matches hidden Markov models (HMMs) . These programs have been previously benchmarked ([55,57,58] and N. Mhatre and N.S., unpublished work) and we have used a stringent cutoff for E-values (0.0005 in PSI-BLAST, 10-8 in IMPALA and 0.1 for HMMer) for identifying sequences with significant similarity to protein kinases. A six-node multiprocessor Linux cluster machine (built by CDC Linux, Inc.) and several Linux-driven PCs have been used in these searches, using stand-alone versions of the programs. The list of predicted human kinases used for further analysis has been arrived at after careful cross-referencing between the results of these methods as well as manual scrutiny for a variety of factors such as length of the kinase domains and presence of critical functional residues. Protein kinases from the human genome and other organisms extracted from databases such as SUPERFAMILY [18,19], InterPro  and http://kinase.com  have been compared and cross-referenced with our initial list of kinases manually, taking into consideration the presence of functional motifs (see below) and the lengths of putative kinase domains.
Domain assignment to the non-catalytic regions of the kinase-containing genes has been made using the HMM search methods by querying each of the kinase-containing sequences against the 3,071 protein family HMMs available in the Pfam database . Transmembrane segments have been detected using TMHMM .
Criteria for identification of sequences with protein kinase functional motifs
Among the various sequence motifs characteristics of protein kinases , the glycine-rich motif with the pattern GXGXXG and the catalytic base aspartate  are suggested to be critical for the accommodation of ATP and phosphorylation respectively . These are essential properties of a protein kinase. After eliminating several sequences on the basis of the results of sensitive profile matching, cross-referencing with other databases and manual analysis, there are 556 sequences, but not all of them have the glycine-rich motif and catalytic base. Although these 556 sequences showed statistically significant matches (E-values better than 0.0001 in PSI-BLAST) with the sequences of proteins shown experimentally to be protein kinases, and hence are likely to adopt the fold of a protein kinase, not all these sequences may function as protein kinases. We have used the glycine-rich motif and the presence of the catalytic base as essential criteria to identify likely functional kinases.
In protein kinases generally, apart from the glycine-rich motif a positively charged residue (most often a lysine residue equivalent to Lys 72 of cyclic-AMP-dependent protein kinase) is also relevant to the accommodation of ATP. However, the number of residues that separate the glycine-rich motif and the charged residue along the primary structure is highly variable among the protein kinases. More important, certain subfamilies of protein kinases, such as MEK-related WNK [61,62], do not have such a positively charged residue supporting ATP but, nevertheless, function as kinases. Many known kinases are not entirely consistent with the glycine-rich motif as well (see, for example, the PROSITE pattern  for protein kinases). For example, protein kinases such as those in the casein kinase 2 subset lack one of the three glycyl residues in the consensus pattern. On the basis of these considerations, the following criteria have been arrived at and used for identifying subset of sequences that posses essential sequence motifs to function as protein kinases. First, at least two out of the three glycyl residues in the glycine-rich motif should be present. Second, an aspartate residue should be located in the sequence at a position equivalent to the catalytic base of known protein kinases.
Of 556 sequences with statistically significant similarity to protein kinases, 53 are found to lack the aspartate catalytic base and hence were not considered for further analysis. Most of these sequences also lack the typical glycine-rich motif. Among the remaining 503 sequences, 55 lacked two or all of the three glycyl residues in the characteristic protein kinase glycine-rich region. However, in these 55 sequences an aspartate residue aligns with the catalytic aspartate of known protein kinases. These sequences were also not included in the detailed analysis, leaving 448 sequences for further analysis. The number of protein kinase-like sequences identified in the current analysis is of the order of the numbers reported by others [11,64].
While the two above criteria are used here to identify putative protein kinases, it should be noted that families of eukaryotic lipid kinases , and bacterial antibiotic-phosphorylating enzymes  and lipopolysaccharide-phosphorylating enzymes , which are evolutionarily related to eukaryotic protein kinases and share the fold, lack two or all of the three glycyl residues in the glycine-rich motif. Nevertheless, they can phosphorylate substrates, although the substrates are commonly not proteins or peptides. It will therefore be worthwhile to explore experimentally whether those sequences identified in the current analysis with aspartate at the catalytic base position, but which lack two or all the three glycyl residues, could function as kinases.
Classification of human protein kinases into subfamilies
A multiple sequence alignment of a collection of related proteins can be transformed into a position-specific scoring matrix (PSSM) or profile to use for further searches. The extent of sequence similarity and of conserved residues reflected in the PSSM will determine the effectiveness of matching of a query sequence. A PSSM generated from closely related family members forming a subfamily could be more effective than an overall family PSSM in specifically identifying further members of the subfamily. Thus a PSSM can be generated by considering the family of all the related sequences, or several PSSMs can be generated by considering subsets of the sequences to form subfamily profiles.
We classified 448 protein kinases according to the scheme of Hanks and Hunter . This classification is based on the kinase catalytic domain. The most closely related catalytic domains have been classified by Hanks and Hunter into subfamilies - AGC (including the protein kinase A, protein kinase G and protein kinase C families), CMGC (the cyclin-dependent kinase, MAP kinase, glycogen synthase kinase, casein kinase 2 families), CAMK (calcium/calmodulin-dependent kinase), RTK (receptor tyrosine kinases), and so on. A large number of eukaryotic protein kinases have been classified into Hanks and Hunter subfamilies, and the sets of classified kinases are available in the Protein Kinase Resource (PKR) [68,69]. We have used the multiple sequence alignment of protein kinases belonging to each of the subfamilies available in PKR to generate a profile of each subfamily that encodes the pattern of conservation and variation in amino-acid sequence characteristic of that group. These PSSMs form a library of profiles representing the signature sequence pattern of the different kinase subfamilies.
If a query sequence consisting of the catalytic domain of a putative protein kinase is searched against this database of profiles, it could be expected that the subfamily to which the query kinase belongs would appear as the best hit with the lowest E-value, with some other subfamilies appearing with less significance. If the E-value for the top hit is significant (E-value less than 0.0001) then the query sequence is said to cluster best with that subfamily and is considered to be a putative member of that subfamily. For example, if the catalytic domain of ENSP224103 is queried against the database of profiles, the AGC subfamily appears at the top of the list with an E-value of 10-121, followed by the CAMK subfamily (E-value 3 × 10-68) and the polo subfamily (E-value: 10-58). This means that the sequence pattern of ENSP224103 matches most significantly with the AGC group of kinases and it is thus considered to be a member of this subfamily. Every human kinase sequence has been scanned against the subfamily profiles to associate it with the appropriate subfamily.
The program IMPALA  has been used to match the kinase query with the subfamily PSSMs. Although IMPALA was designed originally to identify distant relatives of a family, as we have constrained every PSSM by the subfamily sequence patterns, the program helps us identify the putative subfamily to which the query kinase catalytic domain belongs.
After associating the catalytic kinase domains to a Hanks and Hunter subfamily, the non-kinase domains in these proteins were identified using HMM and profile-searching in the Pfam database, as described above. The SwissProt database was consulted extensively to derive functional information on the various kinase-associated domains. Protein kinases that have been implicated in various diseases have been analyzed using information from the SwissProt and literature databases.
Phylogenetic analysis of kinase and associated domains
The pairwise sequence distance between two protein domains has been calculated based on the percent identities obtained from the multiple sequence alignment derived from MALIGN . The following expression was used to calculate a sequence-based dissimilarity measure between two proteins: D = -100.0 ln (pids/100), where pids is the percent identity between a given pair of protein domains. CLUSTAL W  has been used to generate trees. The robustness of the nodes in the trees generated by neighbor-joining methods has been tested by the bootstrap algorithm available in CLUSTAL W. Confidence limits have been tested by bootstrapping the trees 1,000 times. Trees and bootstrap values for various branch orders of domains are described in the text and shown in Figures 3 and 4.
Note added in proof
While this paper was in review, a related paper was published by Hunter and coworkers  that classifies and compares the human protein kinases with protein kinases encoded in other eukaryotic genomes.
Additional data files
A list of kinases identified in the human genome with their classification into groups, the multiple sequence alignment of catalytic regions with critical functional residues indicated, the domain structures of kinase-containing gene products, and a table with Ensembl ids and the corresponding gi ids are available as additional data files with the online version of this article. This information is also freely available from our KinG website .
We thank the anonymous referees for several important suggestions for the improvement of our work and paper. We thank S. Abhiman and K.R. Abhinandan for their help in setting-up the KinG website and additional data files. A.K. is supported by a fellowship from the Council of Scientific and Industrial Research, India. This research is supported by the award of an International Senior Fellowship in biomedical sciences to N.S. by the Wellcome Trust, UK and by the computational genomics project supported by the Department of Biotechnology, India.
- Hanks SK, Quinn AM, Hunter T: The protein kinase family: conserved features and deduced phylogeny of the catalytic domains. Science. 1988, 241: 42-52.PubMedView ArticleGoogle Scholar
- Pawson T: Introduction: protein kinases. FASEB J. 1994, 8: 1112-1113.PubMedGoogle Scholar
- Zhang Z, Yu X, Zhang Y, Geronimo B, Lovlie A, Fromm SH, Chen Y: Targeted misexpression of constitutively active BMP receptor-IB causes bifurcation, duplication, and posterior transformation of digit in mouse limb. Dev Biol. 2000, 220: 154-167. 10.1006/dbio.2000.9637.PubMedView ArticleGoogle Scholar
- Frost DO: BDNF/trkB signaling in the developmental sculpting of visual connections. Prog Brain Res. 2001, 134: 35-49.PubMedView ArticleGoogle Scholar
- Lee MH, Yang HY: Negative regulators of cyclin-dependent kinases and their roles in cancers. Cell Mol Life Sci. 2001, 58: 1907-1922.PubMedView ArticleGoogle Scholar
- Irby RB, Mao W, Coppola D, Kang J, Loubeau JM, Trudeau W, Karl R, Fujita DJ, Jove R, Yeatman TJ: Activating SRC mutation in a subset of advanced human colon cancers. Nat Genet. 1999, 21: 187-190. 10.1038/5971.PubMedView ArticleGoogle Scholar
- Zheng J, Knighton DR, ten Eyck LF, Karlsson R, Xuong N, Taylor SS, Sowadski JM: Crystal structure of the catalytic subunit of cAMP dependent protein kinase complexed with MgATP and peptide inhibitor. Biochemistry. 1993, 32: 2154-2161.PubMedView ArticleGoogle Scholar
- Hunter T: A thousand and one protein kinases. Cell. 1987, 50: 823-829.PubMedView ArticleGoogle Scholar
- Hanks SK, Quinn AM: Protein kinase catalytic domain sequence database: identification of conserved features of primary structure and classification of family members. Meth Enzymol. 1991, 200: 38-42.PubMedView ArticleGoogle Scholar
- Johnson LN, Noble ME, Owen DJ: Active and inactive protein kinases: structural basis for regulation. Cell. 1996, 85: 149-158.PubMedView ArticleGoogle Scholar
- International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.View ArticleGoogle Scholar
- Supplementary information. [http://hodgkin.mbu.iisc.ernet.in/~king]
- Morrsison DK, Murakami MS, Cleghon V: Protein kinases and phosphatases in the Drosophila genome. J Cell Biol. 2000, 150: F57-F62. 10.1083/jcb.150.2.F57.View ArticleGoogle Scholar
- Plowman GD, Sudarsanam S, Bingham J, Whyte D, Hunter T: The protein kinases of Caenorhabditis elegans : A model for signal transduction in multicellular organisms. Proc Natl Acad Sci USA. 1999, 96: 13603-13610. 10.1073/pnas.96.24.13603.PubMedPubMed CentralView ArticleGoogle Scholar
- Hunter T, Plowman GD: The protein kinases of budding yeast: six score and more. Trends Biochem Sci. 1997, 22: 18-22. 10.1016/S0968-0004(96)10068-2.PubMedView ArticleGoogle Scholar
- Leonard CJ, Aravind L, Koonin EV: Novel families of putative protein kinases in bacteria and archea: evolution of the "eukaryotic" protein kinase superfamily. Genome Res. 1998, 8: 1038-1047.PubMedGoogle Scholar
- Robinson DR, Wu YM, Lin SF: The protein tyrosine kinase family of the human genome. Oncogene. 2000, 19: 5548-5557. 10.1038/sj.onc.1203957.PubMedView ArticleGoogle Scholar
- Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of Hidden Markov Models that represent all proteins of known structure. J Mol Biol. 2001, 313: 903-919. 10.1006/jmbi.2001.5080.PubMedView ArticleGoogle Scholar
- Superfamily. [http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY]
- Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MDR, et al: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 2001, 29: 37-40. 10.1093/nar/29.1.37.PubMedPubMed CentralView ArticleGoogle Scholar
- kinase.com. [http://www.kinase.com]
- Drewes G, Ebneth A, Preuss U, Mandelkow EM, Mandelkow E: MARK, a novel family of protein kinases that phosphorylate microtubule-associated proteins and trigger microtubule disruption. Cell. 1997, 89: 297-308.PubMedView ArticleGoogle Scholar
- Shimada T, Kawai T, Takeda K, Matsumoto M, Inoue J, Tatsumi Y, Kanamaru A, Akira S: IKK-i, a novel lipopolysaccharide-inducible kinase that is related to IkappaB kinases. Int Immunol. 1999, 11: 1357-1362. 10.1093/intimm/11.8.1357.PubMedView ArticleGoogle Scholar
- Li Q, Estepa G, Memet S, Israel A, Verma IM: Complete lack of NF-κB activity in IKK1 and IKK2 double deficient mice: additional defect in neurulation. Genes Dev. 2000, 14: 1729-1733.PubMedPubMed CentralGoogle Scholar
- Yang W, Cerione RA: Cloning and characterization of a novel Cdc42-associated tyrosine kinase, ACK-2, from bovine brain. J Biol Chem. 1997, 272: 24819-24824. 10.1074/jbc.272.40.24819.PubMedView ArticleGoogle Scholar
- Groves MR, Barford D: Topological characteristics of helical repeat proteins. Curr Opin Struct Biol. 1999, 9: 383-389. 10.1016/S0959-440X(99)80052-9.PubMedView ArticleGoogle Scholar
- Groves MR, Hanlon N, Turowski P, Hemmings BA, Barford D: The structure of the protein phosphatase 2A PR65/A subunit reveals the conformation of its 15 tandemly repeated HEAT motifs. Cell. 1999, 96: 99-110.PubMedView ArticleGoogle Scholar
- Chen JJ, Crosby JS, London IM: Regulation of heme-regulated eIF-2 alpha kinase and its expression in erythroid cells. Biochimie. 1994, 76: 761-769. 10.1016/0300-9084(94)90080-9.PubMedView ArticleGoogle Scholar
- Thomis DC, Doohan JP, Samuel CE: Mechanism of interferon action: cDNA structure, expression, and regulation of the interferon-induced, RNA-dependent P1/eIF-2 alpha protein kinase from human cells. Virology. 1992, 188: 33-46.PubMedView ArticleGoogle Scholar
- Alam MR, Caldwell BD, Johnson RC, Darlington DN, Mains RE, Eipper BA: Novel proteins that interact with the COOH-terminal cytosolic routing determinants of an integral membrane peptide-processing enzyme. J Biol Chem. 1996, 271: 28636-28640. 10.1074/jbc.271.45.28636.PubMedView ArticleGoogle Scholar
- Leung T, Chen XQ, Tan I, Manser E, Lim L: Myotonic dystrophy kinase related Cdc42-binding kinase acts as a Cdc42 effector in promoting cytoskeletal reorganization. Mol Cell Biol. 1998, 18: 130-140.PubMedPubMed CentralView ArticleGoogle Scholar
- Ponting CP: Novel domains in NADPH oxidase subunits, sorting nexins, and PtdIns 3-kinases: binding partners of SH3 domains?. Protein Sci. 1996, 5: 2353-2357.PubMedPubMed CentralView ArticleGoogle Scholar
- Seipel K, Medley QG, Kedersha NL, Zhang XA, O'Brien SP, Serra-Pages C, Hemler ME, Streuli M: Trio amino-terminal guanine nucleotide exchange factor domain expression promotes actin cytoskeleton reorganization, cell migration and anchorage-independent cell growth. J Cell Sci. 1999, 112: 1825-1834.PubMedGoogle Scholar
- Brotherton DH, Dhanaraj V, Wick S, Brizuela L, Domaille PJ, Volyanik E, Xu X, Parisini E, Smith BO, Archer SJ, et al: Crystal structure of the complex of the cyclin D-dependent kinase Cdk6 bound to the cell-cycle inhibitor p19INK4d. Nature. 1998, 395: 244-250. 10.1038/26164.PubMedView ArticleGoogle Scholar
- Andrade MA, Petosa C, O'Donoghue SI, Muller CW, Bork P: Comparison of ARM and HEAT protein repeats. J Mol Biol. 2001, 309: 1-18. 10.1006/jmbi.2001.4624.PubMedView ArticleGoogle Scholar
- Walden PD, Cowan NJ: A novel 205-kilodalton testis-specific serine/threonine protein kinase associated with microtubules of the spermatid manchette. Mol Cell Biol. 1993, 12: 7625-7635.View ArticleGoogle Scholar
- Lumeng C, Phelps S, Crawford GE, Walden PD, Barald K, Chamberlain JS: Interactions between beta 2-syntrophin and a family of microtubule-associated serine/threonine kinases. Nat Neurosci. 1999, 2: 611-617. 10.1038/10165.PubMedView ArticleGoogle Scholar
- Miyagoe-Suzuki Y, Takeda SI: Association of neuronal nitric oxide synthase (nNOS) with alpha1-syntrophin at the sarcolemma. Microsc Res Tech. 2001, 55: 164-170. 10.1002/jemt.1167.PubMedView ArticleGoogle Scholar
- Bashton M, Chothia C: The geometry of domain combination in proteins. J Mol Biol. 2002, 315: 927-939. 10.1006/jmbi.2001.5288.PubMedView ArticleGoogle Scholar
- Sasakura Y, Ogasawara M, Makabe KW: Maternally localized RNA encoding a serine/threonine protein kinase in the ascidian, Halocynthia roretzi. Mech Dev. 1998, 76: 161-163. 10.1016/S0925-4773(98)00100-2.PubMedView ArticleGoogle Scholar
- Oishi I, Sugiyama S, Otani H, Yamamura H, Nishida Y, Minami Y: A novel Drosophila nuclear protein serine/threonine kinase expressed in the germline during its establishment. Mech Dev. 1998, 71: 49-63. 10.1016/S0925-4773(97)00200-1.PubMedView ArticleGoogle Scholar
- Gaozza E, Baker SJ, Vora RK, Reddy EP: A novel tyrosine kinase induced during growth arrest and apoptosis of myeloid cells. Oncogene. 1997, 15: 3127-3135. 10.1038/sj.onc.1201575.PubMedView ArticleGoogle Scholar
- Kearns AE, Donohue MM, Sanya LB, Demay MB: Cloning and characterization of a novel protein kinase that impairs osteoblast differentiation in vitro. J Biol Chem. 2001, 276: 42213-42218. 10.1074/jbc.M106163200.PubMedView ArticleGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680.PubMedPubMed CentralView ArticleGoogle Scholar
- Filippa N, Sable CL, Hemmings BA, Obberghen EV: Effect of phosphoinositide-dependent kinase 1 on protein kinase B translocation and its subsequent activation. Mol Cell Biol. 2000, 20: 5712-5721. 10.1128/MCB.20.15.5712-5721.2000.PubMedPubMed CentralView ArticleGoogle Scholar
- Yue X, Favot P, Dunn TL, Cassady AI, Hume DA: Expression of mRNA encoding the macrophage colony-stimulating factor receptor (c-fms) is controlled by a constitutive promoter and tissue-specific transcription elongation. Mol Cell Biol. 1993, 13: 3191-3201.PubMedPubMed CentralView ArticleGoogle Scholar
- Nocka K, Buck J, Levi E, Besmer P: Candidate ligand for the c-kit transmembrane kinase receptor: KL, a fibroblast derived growth factor stimulates mast cells and erythroid progenitors. EMBO J. 1990, 9: 3287-3294.PubMedPubMed CentralGoogle Scholar
- Xu Z, Williams BR: Genomic features of human PKR: alternative splicing and a polymorphic CGG repeat in the 5'-untranslated region. J Interferon Cytokine Res. 1998, 18: 609-616.PubMedView ArticleGoogle Scholar
- Clemens MJ, Elia A: The dsRNA-dependent protein kinase PKR: structure and function. J Interferon Cytokine Res. 1997, 17: 503-524.PubMedView ArticleGoogle Scholar
- Stanger BZ, Leder P, Lee TH, Kim E, Seed B: RIP: a novel protein containing a death domain that interacts with Fas/APO-1 (CD95) in yeast and causes cell death. Cell. 1995, 81: 513-523.PubMedView ArticleGoogle Scholar
- National Center for Biotechnology Information: genomes. [ftp://ftp.ncbi.nlm.nih.gov/genomes]
- Ensembl: pep. [ftp://ftp.ensembl.org/pub/current_human/data/fasta/pep]
- The C. elegans Protein Database: Wormpep. [http://www.sanger.ac.uk/Projects/C_elegans/wormpep/]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller , Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Schaffer AA, Wolf YI, Ponting CP, Koonin EV, Aravind L, Altschul SF: IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position specific score matrices. Bioinformatics. 1999, 15: 1000-1011. 10.1093/bioinformatics/15.12.1000.PubMedView ArticleGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14: 755-763. 10.1093/bioinformatics/14.9.755.PubMedView ArticleGoogle Scholar
- Muller A, MacCallum RM, Sternberg MJ: Benchmarking PSI-BLAST in genome annotation. J Mol Biol. 1999, 293: 1257-1271. 10.1006/jmbi.1999.3233.PubMedView ArticleGoogle Scholar
- Lindahl E, Elofsson A: Identification of related proteins on family, superfamily and fold level. J Mol Biol. 2000, 295: 613-625. 10.1006/jmbi.1999.3377.PubMedView ArticleGoogle Scholar
- Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res. 2000, 28: 263-266. 10.1093/nar/28.1.263.PubMedPubMed CentralView ArticleGoogle Scholar
- Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001, 305: 567-580. 10.1006/jmbi.2000.4315.PubMedView ArticleGoogle Scholar
- Xu B, English JM, Wilsbacher JL, Stippec S, Goldsmith EJ: WNK1, a novel mammalian serine/threonine protein kinase lacking the catalytic lysine in subdomain II. J Biol Chem. 2000, 275: 16795-16801. 10.1074/jbc.275.22.16795.PubMedView ArticleGoogle Scholar
- Verissimo F, Jordan P: WNK kinases, a novel protein kinase subfamily in multi-cellular organisms. Oncogene. 2001, 20: 5562-5569. 10.1038/sj.onc.1204726.PubMedView ArticleGoogle Scholar
- Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJ, Hofmann K, Bairoch A: The PROSITE database, its status in 2002. Nucleic Acids Res. 2002, 30: 235-238. 10.1093/nar/30.1.235.PubMedPubMed CentralView ArticleGoogle Scholar
- Kostich M, English J, Madison V, Gheyas F, Wang L, Qiu P, Greene J, Laz TM: Human members of the eukaryotic protein kinase family. Genome Biol. 2002, 3: research0043.1-0043.12. 10.1186/gb-2002-3-9-research0043.View ArticleGoogle Scholar
- Walker EH, Perisic O, Ried C, Stephens L, Williams RL: Structural insights into phosphoinositide 3-kinase catalysis and signalling. Nature. 1999, 402: 313-320. 10.1038/46319.PubMedView ArticleGoogle Scholar
- Hon WC, McKay GA, Thompson PR, Sweet RM, Yang DS, Wright GD, Berghuis AM: Structure of an enzyme required for aminoglycoside antibiotic resistance reveals homology to eukaryotic protein kinases. Cell. 1997, 89: 887-895.PubMedView ArticleGoogle Scholar
- Krupa A, Srinivasan N: Lipopolysaccharide-phosphorylating enzymes encoded in the genomes of Gram-negative bacteria are related to the eukaryotic protein kinases. Protein Sci. 2002, 11: 1580-1584. 10.1110/ps.3560102.PubMedPubMed CentralView ArticleGoogle Scholar
- Smith CM, Shindyalov IN, Veretnik S, Gribskov M, Taylor SS, ten Eyck LS, Bourne PE: The protein kinase resource. Trends Biochem Sci. 1997, 22: 444-446. 10.1016/S0968-0004(97)01131-6.PubMedView ArticleGoogle Scholar
- The Protein Kinase Resource. [http://pkr.sdsc.edu/html/index.shtml]
- Johnson MS, Overington JP, Blundell TL: Alignment and searching for common folds using a database of structural templates. J Mol Biol. 1993, 231: 735-752. 10.1006/jmbi.1993.1323.PubMedView ArticleGoogle Scholar
- Manning G, Plowman G, Hunter T, Sudarsanam S: Evolution of protein kinase signaling from yeast to man. Trends Biochem Sci. 2002, 27: 514-520. 10.1016/S0968-0004(02)02179-5.PubMedView ArticleGoogle Scholar