Sequence database mining and odorant receptor cloning
The general strategy for the search for full-length hOR genes is shown in Figure 1. It was based on absence of introns in coding sequences of mammalian ORs [1,11] as well as high overall sequence similarity and the presence of several highly conserved sequence motifs in all known mammalian ORs [2].
The first step was to identify all currently known hOR sequences by extensive keyword and homology-based searches of several public DNA and protein sequence databases (see Materials and methods). The resulting several hundred sequences were compared with each other by BLAST and multiple sequence alignments. DNA and protein entries were matched. All duplicates were cross-referenced and apparent pseudogenes having frameshifts, deletions and other defects obviously incompatible with receptor function were eliminated. This initial screen identified approximately 90 bona fide full-length (according to criteria described below) hOR genes and a large number of candidate sequence fragments.
The next step was identification of additional members of the family by exhaustive iterative homology searches of translated human genomic sequence databases, particularly the raw, unannotated high-throughput genomic sequences (HTGS), using protein queries corresponding to the known hORs. As the number of identified receptors grew, additional diverse sequences were added to the query list to capture additional ORs in databases. Genomic sequences containing areas of significant homology to ORs were subjected to open reading frame (ORF) searches. Several hundred identified ORFs of sufficient length (>250 amino acids) were translated and compared to known ORs. Three criteria were used to select putative ORs from this set: high overall sequence similarity, the presence of seven predicted transmembrane (TM) regions, and the presence of multiple positionally conserved sequence motifs, described in Materials and methods, that serve as signatures of this gene family. The basic criteria for recognizing a particular OR gene as encoding a full-length, functional receptor candidate were the presence of an uninterrupted ORF starting with an ATG codon and a complete seven-transmembrane unit. A few putative ORs were discarded from the final set because they contained structural features deemed to be incompatible with receptor function, such as insertions of multiple nucleotide sequence repeats in their ORFs.
These are intentionally minimalist criteria designed to exclude apparent OR pseudogenes, but not cryptic pseudogenes, such as those having defects on the level of transcription or splicing or containing subtle but functionally disruptive mutations in receptor coding region. For example, human OR17-24 (hOR17.01.01 in our nomenclature as discussed below) classified by others [23] as a probable pseudogene on the basis of deviations from other ORs in its carboxyl terminus and 5'-untranslated region (UTR) sequence is considered a regular candidate OR according to our analysis. It has to be noted that the final assignment of OR gene or pseudogene status must be based on functional expression data.
As a result of the search described above, 347 putative full-length OR genes have been identified in the human genome. This number includes all the previously known, annotated hOR sequences extracted from the public databases. It is feasible that a small number of full-length hORs escaped detection because of frameshifts or other sequencing or assembly errors in the corresponding HTGS entries. We are continuing routine searches using updated versions of genomic sequences to identify such cases. Because of the very high occurrence of OR pseudogenes in humans [15,16] and the presence of ORs in highly variable parts of human genome [13,24], it is also possible that some polymorphic members of this gene family exist in the human population in both intact and defective allelic forms. A small subset of identified OR ORFs was discarded because of such defects as partial deletions of TM1 or TM7 regions. The argument can be made that these OR genes could encode functional receptors. In addition, it was hypothesized that odorant reception may also be mediated by receptors of a completely different class, such as guanylate cyclases [25]. These caveats notwithstanding, we estimate that we have identified at least 90-95% of all full-length prototypical ORs in the human genome.
Cloning coding regions of full-length human olfactory receptor genes
To validate sequences extracted by genomic database mining, all hOR gene coding sequences described in this paper were cloned by direct genomic PCR using mixed DNA from ten individuals as a template and oligonucleotide primers corresponding to the amino- and carboxy-terminal sequences of the corresponding ORFs. An average of eight independent clones for each receptor were isolated and sequenced in their entirety. While single-nucleotide polymorphisms (SNPs) were detected in many hORs (data not shown), the sequencing data essentially confirmed correct identification of full-length OR-encoding ORFs. Sequences shown in the multiple sequence alignment (Figure 2) are those extracted from genomic sequence database entries.
Studies of human OR gene and pseudogene sequences covering approximately 150 full-length receptor genes [17,18] were recently published. A larger set of annotated putative human OR genes identified by an automated search algorithm became very recently accessible through a worldwide web interface [19]. A detailed comparison and cross-referencing of this overlapping set of data with the output of our independent database mining and genomic cloning effort allowed us to verify the hOR selection reported in this paper. The sets of the hORs identified by us and the other group strongly overlap, which corroborates the essentially complete overview of the hOR repertoire. We found differences between our data and those in the HORDE database, however. These include 29 hOR genes that are apparently identified as pseudogenes in the HORDE database, but encode functional hOR candidates by our analysis, as well as 10 hORs not found in HORDE (see Materials and methods). Our extensive cloning and sequencing data supported the conclusions presented here. Differences are probably caused by DNA sequencing errors in raw high-throughput genomic sequence. Other differences include single nucleotide/amino acid changes as well as discrepancies in the definition of amino and carboxyl termini for some ORs. An important caveat in interpreting these discrepancies is an apparent evolutionary decline in the human olfactory apparatus. Many specific anosmias (inabilities to smell a particular odorant) have been identified in humans [26,27]. Although the molecular basis for these anosmias is not currently known, all or some of them could be caused by hereditary OR defects. The high proportion of hOR pseudogenes [15,16] and the unusually high rate of SNPs in hORs [28] point to significant variability in the composition of the functional OR repertoire in the human population. More research is needed to clarify these issues and to generate the final catalog of functional hORs.
Genomic localization and general features of hORs
It has been previously demonstrated that members of the hOR gene family are distributed on all but a few human chromosomes. Through fluorescence in situ hybridization analysis, Rouquier [15] showed that OR sequences reside at more than 25 locations in the human genome and that the human genome has accumulated a striking number of dysfunctional copies: 72% of these sequences were found to be pseudogenes. We identified a total of 347 putative functional hOR genes located in multiple clusters on all human chromosomes, except for 2, 4, 18, 20, 21, and Y, with the majority (155 hORs) on chromosome 11. Chromosome 11 is followed in frequency by chromosome 1 (42 ORs), 9 (26 ORs) and 6 (24 ORs). By contrast, chromosomes 10, 22 and X seem to carry only a single full-length OR gene. Full-length hOR genes are present in more than 50 distinct clusters and are interspersed with pseudogene OR sequences to a variable degree (data not shown).
The distribution of hORs from the two most abundant sources, chromosomes 11 and 1, in the phylogenetic tree of the OR repertoire is illustrated in Figure 3. While it does reveal some phylogenetic 'superclusters' of ORs, the chromosome-specific subsets of ORs are typically scattered across the phylogenetic tree. In addition to harboring almost half of the functional OR candidates, chromosome 11 contains an intriguing large cluster of 53 full-length OR genes (families 11.01 through 11.20 in Figure 2), which is evolutionarily distinct from the rest of the repertoire and is located near the telomeric end of chromosome 11. Despite the phylogenetic uniqueness, this subgroup has strongly conserved sequence signature motifs of ORs. The remainder of ORs from chromosome 11 are clustered in the centromeric region (q11-12) or at the other end of the chromosome (q24-25). Although the available information on genomic localization of the OR genes was extracted from the Human Genome Project data and used in developing OR nomenclature (see below), mapping the exact chromosomal position of each OR gene or studying their cluster organization was not the goal of this study. Instead, our focus was on identification and comparative analysis of putative full-length hOR sequences produced by conceptual translation of the corresponding genomic OR ORFs.
Whereas the amino-acid identity between a few of the most distant human ORs is as low as 20%, the average identity for a random pair of hORs is in the 35-40% range. This lower limit of sequence identity within the hOR family is similar to that reported for candidate Drosophila OR genes [29] and chemosensory receptors of Caenorhabditis elegans [30]. An average predicted hOR is approximately 315 amino acids long, whereas the shortest OR included in the list of full-length receptors according to our criteria is 291 amino acids (hORo6.12.03). Uncertainty in determining exact amino termini for many ORs (see below) makes it difficult to identify the longest receptor sequence; however, one hOR (hOR19.03.01) is at least 355 amino acids long.
Strictly speaking, a candidate hOR would be the most appropriate designation for a protein belonging to this structurally distinct human G-protein-coupled receptor family, as their odorant-detecting function has not yet been reported. As pointed out in a recent review [3], very little experimental data exists on the expression of OR genes in human olfactory sensory neurons [12,31]. Some studies indicate that OR genes can be expressed in tissues other than the olfactory epithelium, suggesting potential alternative biological roles for this class of chemosensory receptors. Expression of various ORs was reported in human and murine erythroid cells [32], developing rat heart [33], avian notochord [34] and lingual epithelium [35]. The best experimentally documented case is the existence of a large subset of mammalian ORs transcribed in testes and expressed on the surface of mature spermatozoa, suggesting a possible role for ORs in sperm chemotaxis [36,37,38,39]. It has been also hypothesized that olfactory receptors might provide molecular codes for cell-cell recognition in development and embryogenesis [40], including providing guidance for olfactory bulb glomeruli targeting by chemosensory neurons [9].
Conserved sequence motifs
Multiple sequence alignment (Figure 2) reveals single residues and more than a dozen amino-acid sequence motifs of various lengths conserved across the whole OR family, defining a sequence signature motif of an OR family member. Identification of each family member was confirmed by the signature motif conservation. Structural hallmarks of mammalian ORs based on comparison of smaller sets of receptors have been previously reviewed [2,3,5,41,42]. In addition to the traditional multiple sequence alignment (Figure 2), we used the 'sequence logo' presentation [43,44] to show the pattern of sequence conservation in all 347 full-length human ORs (Figure 4). Contrary to the traditional consensus sequences and multiple sequence alignments, this approach allows much more informative and easily interpretable visual representation of motifs and areas of significant sequence conservation in large protein families. Some conserved features are discussed in the following sections.
Transmembrane domains and loops
Strong sequence conservation is apparent in the intracellular loops, probably reflecting interactions of ORs with common intracellular partners, such as Golf [45]. One such previously described conserved motif located at the junction of transmembrane domain (TM) 3 and intracellular loop (IC) 2 and incorporating E/DRY sequence conserved in all G-protein-coupled receptors is MAYDRYVAIC (single-letter amino-acid notation). The highly conserved motif KAFSTCXSH (X is any amino acid) in IC3 contains one of the cysteines from the intracellular pair conserved in many G-protein-coupled receptors, as well as serines and threonines potentially involved in phosphorylation events as discussed below. Other highly conserved regions include TM1, TM2 and TM7; the last two domains are routinely used in the design of family-specific degenerate oligonucleotide primers for PCR amplification of ORs. Most of the sequence variability is observed in extracellular loops EC1 and EC3, membrane-spanning domains TM4, TM5 and to a lesser degree TM3 and TM6, as well as in the extreme amino and carboxyl termini of the receptors (not shown in Figure 4). Structural diversity of these extracellular loops and transmembrane domains is believed to reflect ligand-binding specificity of ORs [1,42]. One model of a ligand-binding pocket in ORs stipulates the existence of 20 variable amino-acid residues on transmembrane helices 3,4 and 5 that constitute the putative ligand 'complementarity-determining region' [5,42]. Taking into account the high observed sequence variability (Figures 2,4) in additional OR domains known to be involved in ligand recognition in other G-protein-coupled receptors [46,47], such as TM6 and extracellular loops, that model may not be sufficient to provide a complete description of odorant binding specificity determinants.
Almost all G-protein-coupled receptors contain a highly conserved cysteine in EC1 and another in EC2. There is direct and circumstantial evidence that the residues form an intra-or inter-molecular disulfide bond in many G-protein-coupled receptors, and in some cases these are critical for their ligand recognition and membrane trafficking [48,49,50]. OR sequences contain one highly conserved cysteine in EC1 and three in EC2, an unusual feature shared by only a few G-protein-coupled receptors, such as chemokine or P2Y1 receptors, which have an additional pair of disulfide-forming cysteines in the amino terminus and EC3 [50,51]. These four highly conserved cysteines suggest that two disulfide bonds may be present on the extracellular surface of ORs. Like other G-protein-coupled receptors, ORs also contain a conserved pair of cysteine residues in IC1 and IC2 (Figures 2,4).
Amino termini
Olfactory receptors have short (approximately 25-30 amino acids) extracellular amino termini that do not have homology to traditional cleavable leader peptides. A strongly conserved stretch of unknown functional significance, EF(I/L)LLG(L/F), is located approximately ten amino acids upstream of the predicted first transmembrane region. Like other G-protein-coupled receptors, ORs contain consensus sequences (NXS/T) for N-linked glycosylation near their amino termini and in EC1. Almost every OR has a single predicted amino-terminal N-glycosylation site and approximately 5% of the repertoire contains a consensus glycosylation site in EC1, whereas none are detected in EC2. A number of ORs also contain NXC motifs in the same locations. It was recently suggested that this motif, occurring in some murine ORs, is a possible novel N-linked glycosylation site [52]. It has to be noted that exact prediction of amino-terminal sequences of hORs based on genomic sequence information is not always clear-cut. While the OR coding sequences are generally believed to be intronless, there is at least one report of an exception to this rule resulting in an amino-terminal sequence heterogeneity [38]. In addition, 5'-UTR introns in OR genes are typically located very close to the OR coding region [53]. Extensive alternative splicing of 5'-UTRs of OR mRNAs have been described [38,54]. Therefore, both the straightforward conceptual translation of the longest OR ORF derived from genomic sequence as well as use of intron/exon prediction programs may result in incorrect identification of OR amino termini. The presence of a potential splice acceptor site, homology to other ORs, and an initiating ATG sequence context conducive to in vivo translation [55] were taken into consideration to predict amino-terminal OR protein sequences in our analysis.
Carboxyl termini
Predicted carboxyl termini of the ORs are short, with an average length of about 21 amino acids, and have significant similarity in their TM7-proximal half, starting with the highly conserved consensus RNK motif immediately adjacent to TM7. Many class I G-protein-coupled receptors, including rhodopsin and the β2-adrenergic receptor, contain one or two cysteine residues in their carboxyl termini about 12 amino acids from the end of TM7. These correspond to a putative palmitoylation site which allows formation of a 'fourth cytoplasmic loop' that maybe involved in G-protein coupling. Only 26% of the human ORs have one or more cysteines in their carboxy-terminal regions, suggesting that carboxy-terminal palmitoylation, if it occurs, is not general for ORs. An additional feature of OR carboxyl termini is a high content of positively charged amino-acid residues occurring in a positionally conserved pattern, a not uncommon feature of many other G-protein-coupled receptors. Yet another typical feature of many of these receptors is the presence of multiple serine and threonine residues in their carboxyl termini and IC3 that serve as phosphorylation sites for G-protein-coupled receptor kinases (GRKs) as well as protein kinases C (PKC) and A (PKA), a mechanism involved in agonist-dependent desensitization of the receptors. Mammalian ORs are known to undergo rapid desensitization after odorant stimulation [56] and GRK3 has been implicated in this process [57,58]. Approximately 20% of the ORs do not have any serine or threonine residues in their carboxyl termini, whereas the remaining 80% have an average of more than two such residues, most of which are located in the vicinity of positively charged amino acids and conform to consensus sequences for phosphorylation by PKC or PKA. In addition, the short third intracellular loop of all ORs contains four to five highly positionally conserved serine or threonine residues interspersed with four equally conserved basic amino acids, comprising a good potential site for phosphorylation.
A proposed nomenclature for functional human olfactory receptor candidates
By analogy with the classical examples of pharmacology-based nomenclatures of some well-studied G-protein-coupled receptor families, a nomenclature of functional olfactory receptors could be based on their odorant specificity. As such information is currently almost nonexistent for human ORs, other criteria, such as the chromosomal localization of their genes, similarly to the recently proposed D. melanogaster OR nomenclature [59], or their structural phylogenetic analysis, as proposed for mammalian ORs [17,22] should be used. We suggest an alternative hybrid nomenclature reflecting both phylogenetic clustering of OR gene products and their chromosomal localization. It is conceivable that phylogenetic closeness and high sequence similarity of ORs may reflect similarity in their ligand specificities. On the other hand, co-localization of a subset of OR genes in a particular genomic cluster might indicate their coordinate regulation and common biological function. One possible example of the latter is the OR gene cluster located on human chromosome 6 at the telomeric end of the HLA complex region. It has been shown recently that OR members from this cluster exhibit unusually high allelic variability, similar to the major histocompatibility complex (MHC) genes [24]. It is hypothesized that this group of human ORs could be functionally linked to MHC and involved in MHC-mediated mate preferences [60,61]. Another relevant example is the discovery of a block of human OR genes, at least one of which (OR7501 = hOR19.06.01) encodes a potentially functional receptor, in subtelomeric regions that are present in 7 to 11 similar copies on several chromosomes [13]. The presence of OR genes in these polymorphically multiplied, rearrangement-prone areas hints at a higher level complexity and individual variations in OR repertoire.
We generated phylogenetic trees using the neighbor-joining algorithm [62], as implemented in the two standard phylogenetic analysis software packages, ClustalX [63,64] and Phylip [65]. In both cases, the dataset was bootstrapped [66] to provide the statistical estimation of the reliability of the resulting tree topology. Two resulting dendrograms having very similar topology were used to identify OR clusters and assign the receptors to a particular family. A phylogenetic dendrogram derived from comparison of full-length OR protein sequences is shown in Figure 5, with the families bracketed. Strong phylogenetic clustering supported by the bootstrap tests (typically, only the bootstrap values higher than 50% were considered significant) as well as high sequence similarity (>40% identity) were used as a main criterion for grouping receptors into a particular family. The second criterion for defining a family was common chromosomal localization, including co-localization in a particular genomic locus if known at the time of analysis. It is believed that local tandem gene duplication is a common mechanism of OR evolution from common predecessors. Not surprisingly, there is a strong correlation between localization of an OR in a particular chromosomal cluster and its position in a phylogenetic dendrogram derived from comparison of full-length OR protein sequences (Figure 5). However, in a number of cases, ORs from different chromosomal loci converge in a phylogenetic cluster, implying recent gene duplication with interchromosomal insertion events from a distant region of the genome.
As a result of this branching analysis, the family of 347 hORs was subdivided into 119 families from 1 to 14 members in each (Figure 5). The minimum pairwise intrafamily amino-acid identity is 43%, and the average minimum amino-acid identity within all of the 77 families having more than one member is 62%. Each chromosome-specific subset of hORs consists of 1 to 50 (for chromosome 11) independently numbered families, whereby each chromosome has family 1, for example. Members of each family are numbered sequentially from 1. An example of our nomenclature is hOR11.01.02, which represents a human olfactory receptor located on chromosome 11, from family 1 and which is member 2 of the family.
Numerous ways of classifying ORs on the basis of cloning technique, clone name or genomic location have been put forward. The most advanced and consistent existing nomenclature, which includes both genes and pseudogenes and is implemented in the HORDE database, is based solely on phylogenetic clustering with consequent division of ORs into families and subfamilies [17,22]. According to this nomenclature, receptor sequences with 40% or more amino-acid identity are considered members of the same family, whereas those sharing 60% or more identity constitute a subfamily. We believe that, at least in some cases, OR names in our nomenclature carry more biologically relevant information and could be more rational. Consider, for example, receptors hOR22.01.01 (hOR11H1), hOR14.03.01 (hOR11H4), hOR14.03.02 (hOR11H6) and hOR14.03.03 (hOR11G2), where the HORDE names are in parentheses. According to our nomenclature (see Figure 5) these four receptors form two complete families, 22.01 with one member located on chromosome 22, and 14.03 with three chromosome 14 members located in a single gene cluster. According to the HORDE division, all four receptors belong to the family 11. Three ORs, including the one from chromosome 22 and two from chromosome 14, fall into subfamily H, whereas the remaining chromosome 14 OR belongs to subfamily G. In other cases, similar OR grouping is provided by both nomenclatures. Nevertheless, even for those cases, the names in our nomenclature carry information about the chromosome of origin for a given OR instead of an arbitrary (sub)family name. An added convenience of the nomenclature we are proposing is the straightforward, 'computer-friendly' format, which would facilitate the handling of hundreds of OR sequences. It allows encoding up to 99 OR subfamilies of 99 members for each chromosome, possibly including polymorphic OR forms, as well as easy sorting of receptor lists.
For the reasons discussed earlier, some additional functional hOR candidates are likely to be identified in the future, and will have to be accommodated by the nomenclature. It may also need to be refined as we learn more about the biological function and specificity of olfactory receptors and the detailed genomic mapping of OR clusters. We believe, however, that the OR nomenclature we describe represents a comprehensive, convenient and possibly more biologically relevant alternative to the existing OR classifications.