Ancient genomic architecture for mammalian olfactory receptor clusters

A new tool for genome-wide definition of genomic gene clusters conserved in multiple species was applied to olfactory receptors in five mammals, demonstrating that most mammalian olfactory receptor clusters have a common ancestry.


Background
Olfactory receptor (OR) genes constitute the largest superfamily in the vertebrate genome, with several hundred genes per species [1][2][3]. This large repertoire of receptors mediates the sense of smell through the recognition of diverse volatile molecules, used to detect food, predators, and mates. Mammalian OR genes reside in about 50 genomic clusters of one to several dozen genes, which are dispersed among many chromosomes [4,5]. Although the number of clusters is similar among species, the typical cluster size varies significantly because of extensive lineage-specific evolutionary events (for example, inter-and intra-chromosomal gene duplications and genomic deletions) [3,[6][7][8].
Comparative analysis of mammalian OR clusters is crucial for deciphering the common evolutionary origins of the OR repertoires, as well as for highlighting inter-species differences.
Large-scale comparisons have mapped most pairwise relations among human and mouse clusters based on sequence similarity between individual genes [9]. A similar study also revealed that, in most cases, pairs of OR clusters that exhibit human-mouse similarity fall into established synteny blocks, which indicates their common origin [10]. Clusters with similarity that did not share synteny relationship were attributed to inter-chromosomal duplication events. Similarly, the combination of synteny data and sequence similarity has been used to map between the majority of human and dog clusters, indicating their common origin [11]. Thirteen dog clusters that could not be mapped were suggested to be 'dog specific'.
A highly relevant endeavor is the recent establishment of a comprehensive network of whole-genome pairwise alignment chains, bridging between local sequence similarity and global synteny mapping, thus providing a better resolution for genome-wide comparisons [12]. Because this system currently includes all complete mammalian genomes published so far, including the marsupial opossum (Monodelphis domestica), it has the potential to assist greatly in conducting a comprehensive multi-species comparison of mammalian OR clusters. Here, we used this powerful framework to establish relationships among mammalian OR clusters on a genome-wide basis. This allowed us to reconstruct a parsimonious scenario for the evolution of gene clusters in the mammalian olfactory subgenome, and to reconstruct a putative OR cluster architecture of the common ancestor of five mammals, spanning nearly 200 million years of phylogeny.

OR genomic mining in opossum and dog
For the OR gene repertoire of the opossum Monodelphis domestica, we mined a total of 1,518 ORs (the nucleotide and protein sequences are available in Additional data files 9 and 10) from the Opossum October 2004 assembly (monDom1). This was achieved using previous computational methodologies, as described previously [3,13]. Because the opossum genome has not been assembled to the chromosome level, the sequence coordinates were referred to genomic scaffolds. The assembly used consisted of scaffolds with average length of about 4.5 megabases (Mb), ensuring inclusion of whole OR clusters or substantial parts thereof in most cases.
Our previously reported canine OR repertoire [14] was a result of combining directed DNA sequencing of the beagle genome and data mining of Celera's 1× poodle genome, and it contained 997 ORs sequences without genomic location. For the purposes of the present study, we re-established the repertoire from the July 2004 assembly of the boxer breed (canFam1). We applied BLAT (BLAST [Basic Local Alignment Search Tool]-Like Alignment Tool) and other procedures as described previously [13], using the published canine ORs as queries. The new dataset obtained included 922 ORs (the nucleotide and protein sequences are available in Additional data files 11 and 12). The two repertoires were compared using Sequencher (version 4.2 for PC; GeneCodes Corp., Ann Arbor, Michigan, USA) with a 97% identity threshold to yield an overlap set of 765 ORs. The main reason why 189 of the poodle ORs failed to overlap the boxer genome is low sequence quality, mainly at the ends of the unmatched poodle ORs. The 209 ORs found in the new mining effort were classified into families and subfamilies and were assigned an appropriate symbol, using the nomenclature system of HORDE (Human Olfactory Receptor Data Exploratorium) [13]. The opossum and dog OR sequences are available in the HORDE database [15] and in Additional data files 9, 10, 11, 12.

Identification of clusters in conservation
We aimed to produce a systematic depiction of the relationships among OR clusters of five mammalian species. For that we developed a three-step algorithm to identify CLICs (CLusters In Conservation), the multi-species equivalent of a genomic cluster. This algorithm progressed from the intraspecies identification of genomic clusters, through the pairwise comparison of individual ORs from different species, to integration in the multi-species framework of CLICs.
In the first step, we defined OR clusters in all five species, based on a selected maximal intergenic distance of 300 kilobases (kb). This resulted in the definition of 48 ± 5 (mean ± standard deviation) clusters with two or more ORs and 24 ± 9 singletons in the four placental mammals (Table 1). For opossum, the numbers were considerably greater, presumably because the fragmented genome assembly in this species (Table 1).
The second step was focused on relationships stemming from the UCSC (University of California at Santa Cruz) alignment net for 12 species pairs [12]. This net is a whole-genome pairwise alignment protocol that provides the best match to every position in the genome, according to both local sequence similarity and global genomic context. Of 5,969 ORs in five species, 5,305 (89%) were found to match an OR in an alignment net with at least one other species (Table 2). A small fraction (3.5%) of alignment pair events were between an OR and a genomic sequence not hitherto defined as an OR gene (see the legend to Table 2). The aligned ORs are shown in Figure 1 in a genomic position context, in which each panel shows a whole genome comparison of two species. The visible contiguous diagonal arrays of OR genes, often spanning considerable genomic segments, provide evidence for the conservation and syntenic organization of OR clusters in different mammals. Synteny often extends beyond the OR clusters, whereby the relevant alignment chain contained non-OR genes as well. For example, this was found to be true by manual examination for 30 out of all 33 human versus mouse chains.
The inter-species OR alignment pairs were filtered to highlight ORs with high confidence of orthology, defined here as 'syntenic orthologs', which correspond to well defined synteny blocks in addition to high mutual sequence identity. The final subset of syntenic orthologs contained OR pairs that belong to alignment chains longer than 100 kb and showing sequence identity higher than a 72% cutoff. Approximately 56% of all ORs (and 71% of the eutherian ORs) were included in the syntenic orthologs category.
Finally, in the third step, CLICs were defined as connected components in an OR graph. A CLIC is thus a set that includes all OR clusters from different genomes, within which every cluster is connected by at least one syntenic orthology edge to at least one other cluster. Whenever several genes from the same species were aligned to a single gene in another species, and were defined as its syntenic orthologs, they were all included in the same CLIC.
The foregoing analysis divided the examined mammalian OR repertoire into 251 mutually exclusive CLICs (Figure 2a,b, and Additional data file 1, with sample data in Table 3). Of these, 48 CLICs contained clusters from more than one species (multi-species CLICs), with most of them containing representations from all five mammals, or at least the four placental mammals. The multi-species CLICs encompassed 90% of the combined mammalian OR repertoire (Figure 2c). These results suggest a significant overall mammalian a Formal release name as appears in UCSC genome browser [56]. b In parentheses: the number of genes used in this study after discarding genes that are mapped to 'chrUn' or 'random', and human genes from subfamily OR7E. OR, olfactory receptor; UCSC, University of California at Santa Cruz.  [56]. A pairwise comparison is directed from a reference genome to a target genome, and is thus not symmetric. b We filtered out alignments between an OR to a genomic segment that was mapped to 'chrUn' or 'random' (approximately 1% of all alignment pairs), was split between two separated genomic locations (approximately 7%), or did not overlap with any annotated OR from the collection described in Table 1 (approximately 3.5%). However, the overlooked segments may contain a genuine OR coding frame, and thus the counts are probably an underestimate for the ORs that have an orthologous counterpart. c The number of alignments that satisfy the criteria of syntenic orthology. The fraction out of the total number of reference genes is given in parentheses. d The total number of alignment chains that together contain all pairs of syntenic orthologs. Usually, each chain contains many such pairs and as such represents a unit of conservation. e Correlation coefficient between the two properties used for defining syntenic orthology: length of the alignment chain from which the aligned gene pair is derived, and the percentage mutual DNA identity between the genes of this pair. Genes with higher identity tend to be in longer chains. OR, olfactory receptor; UCSC, University of California at Santa Cruz. conservation of the cluster configurations, and lead to the inference that many of the OR clusters were present in the evolutionary common mammalian ancestor(s). As a caveat, we note that our analyses, based on large-scale genome alignments, are sensitive to cases of incompleteness of genome assembly.
A single species CLIC may represent a cluster that was not present in the inferred common ancestor, but was introduced more recently into a particular lineage. Although larger genomic clusters were usually assigned to multi-species CLICs, singleton ORs and small clusters often appeared as single species CLICs (Figure 2d).
The number of genes from each species in a given CLIC varied considerably ( Figure 3). Attempting to obtain an overview on cluster sizes in the different species, we preformed an analysis that focused on larger CLICs. This was done to filter noise stemming from small number statistics. Considering CLICs with at least 15 human genes (containing 80% of all genes in multispecies CLICs), human and dog had a similar gene number in a given CLIC, whereas mouse and rat had a larger number (typically 1.5-fold higher). Thus, the observed interspecies variation in repertoire size (Table 1) cannot be explained by the number of clusters but rather by increased cluster size. This is in accordance with previous results [10,16].

Analysis of evolutionary events within CLICs
The definition of CLICs generates a common framework, within which species-specific evolution of OR clusters can be analyzed ( Figure 3). A close examination of the CLICs reveals events such as cluster duplication, cluster deletion, and cluster splitting. The relevant evolutionary scenarios include unitary events (for instance, a genomic deletion in a single lineage) as well as complex events that occurred along more than one lineage. Nevertheless, absence of a CLIC from a genome may result from an assembly problem; this is particularly relevant to the opossum genome.
Cluster deletion is evident for CLIC #1, which contains one conserved OR cluster in all mammals except human ( Figure  3b). A human-specific cluster deletion appears to be the best explanation, because otherwise there is a clear synteny relationship in this region for all five species examined ( Figure   3b). We performed a BLAST search of the mouse OR protein sequences of this CLIC against the human repertoire, but the matches were of low sequence similarity (around 50% identity), supporting the absence of any human orthologs. This human-specific deletion of an OR cluster is intriguing because in mouse the relevant OR cluster on chromosome 4 was tentatively associated with the capacity to smell isovaleric acid [17,18], an odorant that many (but not all) humans can detect [19].
Inter-chromosomal cluster dispersion is observed for CLIC #31 (Figure 3c). It contains one OR cluster from every species except dog, whereas dog is represented by four clusters. Two of the dog clusters belong to two different human-dog synteny blocks, with the breakpoint located at the middle of the human OR cluster. For the two other clusters there is no conserved synteny beyond the stretch of OR genes. These inferred novel OR locations in the dog genome could be created by an inter-chromosomal cluster duplication, or by movement of part of the cluster. In addition, four dog-specific CLICs (#113, #115, #116, and #123; see Additional data file 1) with a similar subfamily composition (belonging to the OR6 and/or OR9 families) might also have been created by a partial cluster duplication originating in CLIC #31. However, these CLICs belonged to short local alignments, and therefore were not integrated into CLIC #31. Family OR6 has greatly expanded in the rat lineage too, in this case within a single cluster assigned to CLIC #31 ( Figure 3c).
Another example of cluster duplication is CLIC #32, which contains two clusters from each of the nonhuman species, whereas in human there are three clusters, two of which (chr14@19.5, chr15@19.8) are highly similar to each other ( Figure 3d). This CLIC appears to capture a recent event of cluster duplication in the human lineage, as previously suggested, based on a similarity in the subfamily content [3]. Indeed, all members of the two human clusters showed at least 90% mutual protein identity, which is a very high score.
In parallel, the best mouse hits for most members of the two human clusters were found in a single mouse cluster (chr14@45.4). These results further support evidence of cluster duplication in human lineage.
In addition, genes from family OR4 are divided in a different way between the two clusters of each species, although they Conservation of synteny of OR genes Figure 1 (see previous page) Conservation of synteny of OR genes. (a) All ORs from each species are ordered along the axis according to their genomic location from chromosome 1 to X (or by scaffold number in the case of the opossum), and by the internal megabase coordinates in each chromosome. Each point represents an alignment between two ORs from different species in the UCSC alignment net, colored according to the degree of DNA sequence identity (x-axis for the reference species, y-axis for the target species). Diagonals in both directions represent conservation of gene order, whereas reverse diagonals indicate a reverse of gene order relative to the 'plus' DNA strand. Off-diagonal points generally indicate micro-rearrangements, but those that are associated with low percentage identity possibly represent alignment errors. (b) Zoomed human versus mouse comparison, with chain numbers (by UCSC hg17 versus mm6 alignment net) indicated for the 16 alignment chains that contain at least six pairs of syntenic orthologs. Chains #95 and #183 represent disrupted synteny, because the alignment of a succession of ORs from human chromosome 6 is split between mouse chromosomes 13 and 17 (as described by Amadou and coworkers [26]). Chains #375 and #118 capture a genomic inversion. OR, olfactory receptor.
still belong to one CLIC ( Figure 3d). This is consistent with the notion that the two clusters were originally on the same ancestral chromosome, as is indeed the case for human chromosomes 14 and 15 [20]. Chromosomal translocation was suggested to be a possible mechanism for fragmentation of a single genomic cluster into smaller clusters, whose ORs are from a common phylogenetic subfamily [21].

The reconstruction of the ancestral olfactory subgenome
For the purpose of reconstructing the probable ancestral olfactory mammalian subgenome, we considered all multispecies CLICs excluding six that appeared only in the two closely related rodents (Additional data file 1). These 42 CLICs were inferred to be present in the eutherian common ancestor genome. However, we cannot rule out the possibility that a single species CLIC existed in the ancestral genome but was lost in all but one species. Such hypothesis may be especially valid for the dog-specific CLICs, for which only one event of cluster deletion in the human and rodents lineage is required, after the split from the dog. We therefore conducted a BLAST search with the 20 protein sequences of the 12 dog-specific CLICs against the human, mouse, and dog OR repertoires. Ten of these ORs are probably recent duplications in the dog OR repertoire, exhibiting high protein identity (>90%) to other dog ORs. The other ten genes were in general closer to their dog hit in comparison with human and mouse. Among the 42 multispecies CLICs, 26 were common also with opossum and were inferred to represent ancestral clusters in the last common ancestor of eutherians and marsupials. Less than one quarter of the opossum OR clusters (36 out of 163) were integrated into multispecies CLICs, as compared with 74% of all eutherian clusters (212 out of 288). In order to examine the likelihood of an ancestral origin of the remaining opossum clusters, we examined the opossum clusters disregarding the previously employed CLIC definition constraints.
Most of the opossum-specific CLICs (96 out of 127) were not found at all on the opossum-human or opossum-mouse alignment nets. These CLICs contained 232 ORs (out of a total 1,518 ORs in opossum), and ranged in size from 1 to 37 genes (Additional data file 1). At least 54 ORs of this group belonged to a unique expansion in the opossum genome, which exhibited low sequence similarity to eutherian genes (an average of 48% identity at the protein level). The other ORs belonged to OR subfamilies shared with eutherians, which were probably excluded from the alignment net because they were too divergent at the DNA level or because of assembly artifacts. Indeed, two-thirds of these scaffolds were less than 100 kb long. We found that 91% of the entire opossum genome is included in human-opossum alignment chains larger than 100 kb [22]. This is in good agreement with our finding that 1,340 out of 1,518 ORs (88.2%) are included in multi-species CLICs.
Each of the 31 remaining opossum-specific CLICs was merged with a predefined multi-species CLIC, which contained the gene with the highest sequence similarity in the human-mouse alignment net. No minimum sequence identity or chain length was required. As a result, the additional opossum clusters joined 20 multispecies CLICs; 13 of the target CLICs were devoid of opossum cluster beforehand (dAdditional data file 1). Although this procedure may lead to the inclusion of false positives, the finding still provides evidence suggesting an early mammalian origin of 38 out of the 42 inferred ancestral clusters, and suggests that four CLICs (#14, #17, #39, and #42) are eutherian specific. However, the latter conclusion should be taken with caution, given the incomplete disposition of the opossum genome assembly.
For each of the 42 inferred ancestral clusters, an ancestral gene count was estimated, using a simple statistic derived from the cluster size distribution of the corresponding CLIC (Table 3). We note that assessing the number of genes in ancestral clusters is problematic, because contemporary clusters reflect an ongoing process of gene duplication and deletion, not necessarily at the same rate. With this caveat, it appears that the mammalian ancestor had approximately 1070 OR genes. Of these, 38% were disposed in two large clusters of more than 100 genes (CLIC #23 and CLIC #26), 59% in medium size clusters of 7-44 genes, and the remaining 3% being in small clusters of one to six genes. It is also possible, with appropriate caution, to reconstruct the internal organization of the ancestral clusters (Figure 4 and

Chicken-mammal conservation
The chicken OR repertoire was found to contain 554 genes, of which 476 (86%) were pseudogenized and only 78 had intact open reading frames [7,23]. The chicken OR repertoire was highly restricted, with 75% of the genes belonging to a single family (a newly defined family OR14; Olender T and coworkers, unpublished data). Only 8% of the chicken ORs were assigned a genomic location, even though 90% of the total chicken genomic sequence was contained within assembled chromosomes [7]. The failure of the majority of the chicken ORs to undergo whole-genome shotgun assembly probably stems from their high mutual sequence similarity.
The CLIC-defining algorithm was applied to the chicken OR gene repertoire. The cutoff of chain length was lowered to 50 kb, and no sequence similarity cutoff was used beyond the maximal expectation value embedded in the alignment chain definition. Only two chicken clusters (with a total of 13 OR genes) could be joined to the previously defined mammalian CLICs (Figure 3a and Additional data file 5). Most of the remaining chicken ORs, including those missing a genomic location, could not be aligned beyond the OR coding region.
Half of them were included in chains of 1,000-50,000 base pairs (bp) long, and hence they had the potential to contain an entire 1 kb OR coding region (Additional data file 6). This finding is perhaps unsurprising, given that most of the chicken ORs belong to chicken-specific expansion.
The largest chicken cluster, with 12 class I ORs (including four pseudogenes), belonged to CLIC #23 (Additional data file 5), and was included in an alignment chain that spanned 285 kb on chicken chromosome 1 and 2,500 kb on human chromosome 11 (with 103 human ORs). This chain also contained the syntenic β-globin cluster, with four chicken β-globins as compared with five human genes [24,25]. The second match between chicken and mammalian clusters was in CLIC #16, which contained a single OR from chicken chromosome 1 (belonging to subfamily OR10AC) aligned to human OR10AC1P on chromosome 7 (Additional data file 5). The human genomic region, related to the relevant alignment chain, contained six human OR genes (included in CLIC #16) and five bitter taste receptor genes. Of these, only one OR (OR1AC1P) and one taste receptor (TAS2R49) appeared in the human-chicken alignment net, indicating their conserved synteny. In addition, this chain included two conserved ephrin receptors (EPHB6 and EPHA1).

Discussion
The identification of orthology relationships among OR genes has been recognized previously as a complicated task [6,26,27]. OR orthologs have been defined for several pairs of genomes on the basis of amino acid sequence similarity [4,8,10,28]. However, signals of high sequence similarity among true orthologs are obscured in this large gene superfamily by extensive gene duplication as well as gene conversion and sequence divergence. A recent multi-species approach for ortholog identification increased the robustness of inference, by seeking three-way dog-human-mouse mutual best hits [14]. Naturally, such a strict requirement also reduced the sensitivity of detection. Alternative algorithms for large-scale orthology identification, such as COG [29], INPARANOID [30], and OrthoMCL [31], entailed complex many-to-many orthology relationships within a group of proteins but also relied solely on mutual coding sequence similarity. Enrichment by gene-related structural or functional data has proven effective in orthology determination [32,33], but it is impractical in the case of the OR genes because of the paucity of relevant information.
In the present study we took a novel approach that introduced the use of global synteny on top of local sequence similarity. Based on whole-genome pairwise alignments among five mammals, pairs of syntenic orthologs were identified with high confidence, supported by the conservation of genomic location. Applying the connected component algorithm to syntenic ortholog pairs from all species captured the intricate relationships within the OR gene superfamily, as manifested in the definition of CLICs. This resulted is groups of ORs presumably derived from a specific genomic location in a presumed evolutionary ancestor. We note that our conclusions are based on the assumption that very limited interaction/ swapping of sequences has occurred among genes and clusters, for instance by gene conversion.
Another concept that we adopted to deal with the complexity of the OR gene superfamily is the definition of an evolutionary common ancestor at the cluster level rather than at the gene level. Common ancestry of similar clusters has previously been inferred only with regard to pairs of specieshuman versus mouse [9,10] or human versus dog [11] -or to specific clusters [34,35]. It has also been observed that the number clusters is surprisingly similar among mammals, despite considerable variation in the total repertoire size [4]. An important advance presented here is the definition of multi-species sets of conserved clusters, providing one-toone mapping among clusters of different species. These newly defined CLICs revealed evidence of an ancestral evolutionary origin of the mammalian OR clusters, rather than independent cluster formation in each lineage. It suggests that the uniform number of mammalian clusters stems from an ancestral common architecture that remained practically unchanged in contemporary species.
The CLIC framework was found to apply also to the OR repertoire of the more ancient opossum. Hence, the formation of the OR cluster architecture appears to have taken place before the split between marsupials and eutherians 185 million years ago. Importantly, the analysis at the cluster level revealed a conservation signal that could hardly be detected at the individual gene level, because of the relatively high (approximately 40%) DNA sequence divergence in human-opossum pairs of OR coding regions (Additional data file 7). However, in contrast to other species, ORs in the opossum formed numerous additional clusters that could not be assigned to the shared set of CLICs. This phenomenon could represent lineage-specific expansion of the marsupial repertoire or, alternatively, loss of ancestral clusters from the eutherian lineage. Finding out which of these alternative scenarios is correct could be aided by an outgroup genome such as that of the monotreme platypus Ornithorhynchus anatinus [36]. We note that current fragmentation of the opossum genome assembly could be an alternative reason for hampering proper CLIC joining of opossum ORs.
The question of a potential origin of OR clusters beyond the mammalian lineage has been addressed here by broadening the comparative analysis to the chicken OR repertoire. Accordingly, only one nonsingleton cluster, which includes class I receptors, has an evident common origin with a corresponding mammalian cluster. This cluster was previously suggested to be the most ancient olfactory cluster [3]. The inability to identify CLIC relationships for other clusters in the chicken genome could be due either to considerable repertoire divergence after the mammalian-avian split or to massive OR gene loss in the avian lineage. The latter is supported by a relatively poor diversity and massive pseudogenization of the chicken OR repertoire [7,23]. We have also begun to analyze the OR repertoire of the frog Xenopous tropicalis [7], which currently is too fragmented to allow CLIC analysis. However, we were able to discern considerable diversity, with practically all human-defined OR gene families amply represented (unpublished data). This result, which is in agreement with previously published work [7], may indicate that a rich OR repertoire existed before the amphibian-reptilian split, providing further support to the chicken OR loss scenario.
The CLIC analysis provides a framework for a further level of analysis beyond evolutionary conservation, namely the study of variability among repertoires. The ongoing process of 'birth and death' of genes leads to large fluctuations in the number of functional receptors [37]. As the diversity of the OR repertoire may serve as an indication for functional olfactory acuity of an organism [4,38,39], comparing variability at the cluster level (for instance, rearrangements within clusters and loss or gain of complete clusters) would help to discern potential functional differences among species. An example reported here is the loss of a complete cluster from the human lineage. A presumed syntenic mouse genomic cluster belonging to CLIC #1 was associated with smelling isovaleric acid [17,18]. However, because humans are still capable of detecting this odorant, it is possible that OR(s) from another cluster compensates for this loss.
The increase of repertoire size can occur via two main processes: expansion within clusters, or dispersion to new genomic locations. The former appears to dominate the increase of the rodent repertoire, as illustrated by a consistent excess of rodent genes in mammalian CLICs. Extensive tandem gene duplication in rodents was pointed out previously as a dominant factor in OR evolution [8,10,16]. The present study further relates this process to the variation between mouse and rat repertoire sizes, which appears to have arisen mainly from a dramatic expansion of a single rat cluster (CLIC #31). This may represent an enhanced recognition or discrimination of the rat toward a specific set of odorants, potentially related to a species-specific ecologic/behavioral niche.
Cases of lineage-specific clusters have previously been described for the human repertoire [40,41]. A similar phenomenon has been demonstrated here by several dog-specific CLICs that represent an expansion of subfamily OR6C to eight distant locations in the dog genome. Interestingly, the same subfamily has been amplified independently via an inter-chromosomal process in the dog genome, and via an intra-chromosomal duplication within a single rat cluster.
We considered whether our analysis identifies evidence for a single OR that seeded the evolution of a cluster. Such a scenario might appear as a CLIC composed of a single gene in one lineage and more in others. We identified one case, namely CLIC #3, which matches the suggested scenario, with one OR in the mouse and two to four ORs in the other species. However, this situation is indistinguishable from a speciesspecific deletion.
An important finding of the present analysis is that OR clusters represent an ancient genomic architecture of the mammalian genome. This conserved feature implies biologic importance, potentially related to a common regulatory mechanism of gene expression control [42][43][44][45]. Further support for this notion derives from the observation that the primate-specific OR7E subfamily, composed chiefly of nonfunctional pseudogenes, shows a much sparser cluster architecture, with a considerable number of singletons. One mechanism of cluster generation and propagation is related to genomic sequence repeats [46]. It is noteworthy that shared clustering appears despite the diversity of repeat elements in different mammalian genomes [47,48].
The correct description of evolutionary relationships among mammalian OR clusters is important for an additional reason; it could provide a useful avenue to the identification of regulatory elements. The framework of CLICs provides a natural set of orthologous sequences for the identification of ANCORs (ancestral noncoding conserved regions [49]) within an individual OR cluster. Such elements are appropriate candidates for a regulatory role, such as transcription regulation or post-transcriptional modification. A great challenge in the study of ORs is to elucidate the regulatory mechanisms that mediate exclusive expression of a single allele of one receptor per olfactory neuron. Exploring ANCORs within CLICs may suggest putative key players in this process.

Conclusion
The genomic architecture of mammalian OR gene clusters has an ancient evolutionary origin, preceding the marsupialeutherian split. Species-specific evolution has further shaped the different olfactory subgenomes, both via gain and loss of complete clusters, and via expansion and contraction of existing clusters. The framework of CLICs enables one to pinpoint genomic commonalities and differences among species, and potentially relate them to olfactory capabilities. The same approach may also be applicable for other gene superfamilies. Figure 3 (see following page) CLICs of OR genes. (a) CLIC (columns) are shown by human genomic order (see Table 3

OR genes and clusters Human
The complete human OR repertoire with 851 genes and pseudogenes, including genomic coordinates mapped onto the May 2004 (hg17) assembly, were extracted from the HORDE database [13]. Subfamily OR7E (86 genes), representing a primate-specific expansion [41], were eliminated from the analysis.

Mouse and rat
A total of 1,296 mouse ORs were kindly provided by Zhang and Firestein [18] (accession numbers AY072961-AY074256). A total of 17,58 rat ORs [16] were kindly provided by J Young and B Trask, University of Washington (accession numbers are detailed in Additional data file 8). We assigned their genomic locations on the mouse March 2005 assembly (mm6) and rat June 2003 assembly (rn3) using BLAT [50].

Chicken
A total of 554 chicken ORs were used as described [7], with their published coordinates on the chicken February 2004 assembly (galGal2). For the purposes of this study, genes that were mapped to an undefined chromosomal location (the virtual chromosomes 'chrUn' and chromosomes with the suffix 'random') were filtered out (Table 1).
For each of the six repertoires (Table 1), sets of OR genes located on the same chromosome with no more than 300 kb distance between consecutive genes were identified as OR clusters, including singleton clusters with a single gene. Because the number of identified clusters decreases as a function of the maximal intergenic distance allowed, we selected a distance criterion of 300 kb, at which the rate of this decrease becomes more moderate (Additional data file 3). The genomic coordinates of all analyzed ORs and their assignment to clusters are available in the HORDE database [15] and in Additional data file 8.

Data mining procedures of opossum ORs
The first 500 ORs were identified based on the UCSC humanopossum net alignment net (assembly hg17 versus monDom1). A BLAT search [50] was performed using all 499 opossum ORs that were found to represent true OR sequences after translation into proteins. All hit locations were then extracted from the genome and subjected to further protein translation procedures.
The first TBLASTN search [51] was performed using the following 24 OR sequences: cOR4Z2, cOR9S6, MOR177-3, MOR220-1, MOR248-10, MOR263-4, MOR264-6, and an additional 17 consensus sequences representing the 17 human OR families. The second, third, and fourth included 30, 17, and 65 opossum ORs, respectively. The criterion for choosing a particular OR to serve as query was that it would represent the OR subfamily that was not included in the previous rounds of data-mining queries. The same criterion was used to select 90 frog ORs and 12 chicken OR sequences for a fifth round of TBLASTN search. Because the last two rounds did not discover additional ORs, the search was discontinued.
TBLASTN search was conducting setting the parameter -b to 1,000.
All hits that were longer then 30 amino acids and showing at least 30% identity were extracted from the genome and expanded to contain 2,000 bp. These were then translated into proteins and aligned, by CLUSTAL [52], to a multiple alignment of human and mouse ORs [53]. Those that were found to contain the seven-transmembrane domains, and one-third of the amino-terminus and carboxyl-terminus typical lengths were considered automatically as intact ORs; otherwise they were translated via FASTY. The typical OR amino-and carboxyl-termini lengths are based on human and mouse OR repertoires.
To classify an opossum sequence as an OR we required a sequence identity of at least 40% over at least 100 amino acids to any tetrapod OR. Sequences that shared more than 30% sequence identity but less than 40% were searched by BLAST against GPCRDB [54] and all known OR sequences. The score of the best hit from each search was collected. The decision (OR or non-OR) was based on the highest score. Classification into OR families and subfamilies was performed as explained elsewhere [3,13].

OR genes in alignment chains
An automatic tool (GENETALIGN) was designed to mine and present gene-related information from the UCSC alignment nets [50]. It accepts a list of gene names and coordinates in two aligned species and generates pairs of aligned genes. The 12 alignment nets, which correspond to all available pairwise comparisons among the five mammalian genomes analyzed here (Table 2), were downloaded from the UCSC web page [55]. Pairwise sequence alignment files in AXT format [56] were scanned for alignment blocks whose genomic coordinates overlap with any annotated OR coding sequence in the reference species, and the exact corresponding segments were extracted. Segments shorter than 100 bp in the reference species were filtered out, as were alignments to segments longer than 1,500 bp in the compared species, which are much longer than the typical OR coding sequence. Two alignments separated by no more than 500 bp in both species were joined by adding the required gaps. As a result, a list was constructed using GENETALIGN; for each gene in the reference species, this tool specifies the genomic segments to which the gene is aligned, together with the alignment length and percentage DNA sequence identity. The coordinates of the aligned sequences in the target species were compared with the OR coding sequences annotation of this species. Alignments to genomic segments that were not annotated as ORs, were split between two separated genomic locations, or Aloni  were located in undefined chromosomal locations were discarded.
To associate genes with alignment chains, alignment annotation files in NET format [56] were scanned for chains that overlap with coordinates of ORs. Only those with an overlap of at least 100 bp were selected. Then, genes associated with a single chain were selected, and the chain number, length, and type were added to the alignment description from the previous step. This procedure was performed also for the chicken versus human alignment net [55].
All analysis procedures were performed on whole-genome alignments.

Definition of syntenic orthologs
Syntenic orthologs are defined as a pair of ORs from two different species located within the same alignment chain, which is at least 100 kb, and sharing a minimum of 72% DNA sequence identity over the OR coding region. The criterion of 100 kb for minimum chain length was selected to provide a global conservation of genomic neighborhood and usually represents previously defined synteny blocks [50]. The identity value corresponds to half of a standard deviation below the mean sequence identity of all eutherian aligned pairs (78%). Such a subset was defined for every pair of genomes that was analyzed. For the chicken-human comparison, the cutoff of chain length was lowered to 50 kb, and no sequence similarity cutoff was used beyond the maximal expectation value embedded in the alignment chain definition.

CLIC generation
CLICs are defined over a graph of OR genes (nodes), connected by two types of edges. One type connects pairs of syntenic orthologs, the other type represents immediate neighborhood relations within an OR gene cluster. A CLIC is a connected component of this graph (all the genes connected to each other either directly or via other nodes in the group). Therefore, all genes from one genomic cluster belong to the same component, and all of their orthologs, together with their complete clusters, are also included in this component.
An algorithm to divide a graph into its connected components was constructed using the clustering functions in MATLAB statistics toolbox. It was then applied to the set of 5,969 OR genes from all species. The sequence identity parameter for defining syntenic orthology, as well as the intergenic distance parameter for defining genomic clusters, were aimed to minimize the inclusion of clusters from different chromosomes of the same species in one CLIC.
For each multispecies CLIC, the mean cluster size was calculated, the clusters whose size diverged more than one standard deviation from the mean were excluded from the following calculation, and the recalculated mean served as the estimated consensus cluster size. This was performed to eliminate the effect of species-specific expansion or deletion on the estimated ancestral cluster size.

Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is a table listing the complete collection of the 251 CLICs defined in the study, each with the number of genes from each species, and the corresponding cluster names. Additional data file 2 is a table of genomic coordinates of non-OR genes shown in Figure 3. Additional data file3 is a graph of the number of clusters as a function of the maximal intergenic distance parameter. Additional data file 4 is a figure demonstrating the reconstruction of two ancestral clusters. Additional data file 5 is a figure demonstrating conservation of ORs and their genomic regions in the chicken genome. Additional data file6 is a graph describing the length of chicken-human alignment chains containing ORs. Additional data file 7 is a graph of the distributions of mutual sequence identity of aligned OR gene pairs identified by GENETALIGN. Additional data file 8 is a table of the genomic coordinates and cluster assignment for all OR genes used in this study. Additional data file 9 contains DNA sequences in FASTA format of the opossum ORs used in this study with their genomic coordinates in the Opossum October 2004 assembly (monDom1). Additional data file 10 contains protein sequences in FASTA format of the opossum ORs used in this study with their genomic coordinates in the The reconstruction of an ancestral cluster Figure 4 (see previous page) The reconstruction of an ancestral cluster. (a) Detailed comparison of gene content and order is shown for the four clusters included in CLIC #39 (see Table 3), containing genes from the OR1 and OR3 families. A schematic ancestral configuration is shown at the bottom row, with inferred minimal number of genes. Genes are presented as triangles colored by subfamily affiliation (bottom row; yellow for subfamily OR1G). Inferred chromosomal rearrangements relative to the ancestor are specified for each species: circle = insertion, X = deletion, arrow = inversion, broken line with number of genes = tandem duplication. A complex duplication in rodent genomes, involving subfamilies OR3A, OR1P, and OR1A, is marked with a rectangle. This duplication was probably formed via several events, some of which occurred after the split between mouse and rat lineages. The same region had experienced another independent event in the dog genome, in which three genes from subfamilies 1A and 1P were inverted as one unit. Tandem duplication in one end of the rodent clusters forms a series of numerous adjacent highly similar genes from the same subfamily (OR1J). The human and mouse orthologous clusters were studied and compared previously [27], and a complex orthology relationship among the genes was described. (b) A phylogenetic tree of CLIC #39 ORs from which the ancestral cluster gene count can be inferred. The phylogenetic tree was generated with Mega version 3.1 [58] using ME algorithm, and Poisson correction for distance calculation. Protein sequences were aligned with Clustalx [59]. The colors of circles next to the phylogenetic branches indicate species (blue = dog, green = rat, orange = mouse, red = human). OR51D1 serves as an out-group. CLIC, clusters in conservation; OR, olfactory receptor.