Conservation anchors in the vertebrate genome
© BioMed Central Ltd 2005
Published: 29 June 2005
Genomic segments that do not code for proteins yet show high conservation among vertebrates have recently been identified by various computational methodologies. We refer to them as ANCORs (ancestral non-coding conserved regions). The frequency of individual ANCORs within the genome, along with their (correlated) inter-species identity scores, helps in assessing the probability that they function in transcription regulation or RNA coding.
The acronyms used for conserved regions (or elements, tags, or sequences) in different publications
Highly conserved region
Conserved noncoding sequence
Conserved non coding
Conserved non genic
Conserved sequence tag
Multispecies conserved sequence
Ultra conserved element
Evolutionary conserved region
Conserved noncoding element
Ancestral non-coding conserved region
Initially, small-scale analyses comparing human and mouse (or other species) suggested conservation outside coding regions [5, 6]. The identification of such conservation in the vicinity of specific genes (in proximal flanking regions, untranslated regions or UTRs, and introns) helped in the exploration of corresponding regulatory regions. Somewhat broader studies suggested sequence conservation in large sets of orthologous pairs [3, 7, 8]. The advent of full genomic sequences of human  and mouse  allowed the first large-scale analyses not limited to gene-related regions. A comparison between human chromosome 21 and the syntenic region in mouse  revealed a significant number of noncoding conserved elements, many of them far from gene-coding regions .
A second approach relies on distant vertebrate comparison and is thus an extension of species comparison. An evolutionary distance of more than 300 million years will result in two orthologs drifting to a similarity level like that of unrelated sequences (around 30%), unless selection is at work . Any human sequence that can reliably be aligned to chicken or fish sequence, therefore, strongly suggests functional constraints. The chicken genome (around 300 million years divergence from human) was proposed years ago as the best candidate for identifying human ANCORs , but only recently has the full genome sequencing of this species been accomplished . The consequent interspecies comparison shows that about 2.5% of the human genome can be reliably aligned to a chicken sequence. This portion is predicted as functional with high specificity, supported by the fact that more than half of it is among the 5% most conserved between human and rodents . However, a sensitivity reduction is reflected by a low representation of known human regulatory elements (30% are conserved in chicken, as compared to 60% in mouse). This is in accordance with a previous multispecies comparison  that noted the effectiveness of the chicken genome in comparative analyses but indicated its limited sensitivity for detecting functional non-coding elements.
The most distant complete vertebrate genome available for comparison with the human is that of the pufferfish Fugu rubripes . Here, the number of detectable non-coding conserved elements is dramatically reduced  but the likelihood that they are functional improves as well, as a result of 450 million years of divergence. The Fugu comparative study identified approximately 1,400 ANCOR0.01% segments genome-wide (typical length of 200 bp and average identity of 84%). These are greatly conserved in chicken and rodents (average identity of 96-97% with human sequences).
A fourth property used for functional element identification is hierarchical organization into a family-like structure within a reference species. A paper utilizing this approach  has demonstrated that while the vast majority of the top 5% of conserved elements between human and rodents are unique (singletons) in the human genome, a small number (4%) of these elements form intra-human paralogous clusters containing from two to around 1,000 members. The implication is that belonging to such a paralogous group enhances the probability of function. Statistically, these elements have a frequency of 0.1% in the genome (ANCOR0.1%), but the independent parameter of paralogy adds a new dimension to the functional pursuit. It should be stressed, though, that the resulting subset is not necessarily the most conserved 0.1%.
The ANCORs discovered by the methods described above can be examined for potential function on the basis of an array of attributes, such as overlap with expressed sequence tags (ESTs), inferred transcribed RNA structure, and location in the vicinity of exons [13, 21]. Some studies explore conservation-independent parameters, such as the potential for being nuclear matrix/scaffold attachment regions , which have subsequently been shown to be correlated with inter-species conservation . Sometimes, a conjunction of both interspecies comparisons and conservation-independent criteria are used, as exemplified by a study that offers an improved definition of transcription factor binding sites . Given that, in general, not all functional elements are highly conserved, and vice versa, direct prediction of functional properties serves as a powerful complement to the comparative methods described.
The resulting sets of ANCORs obtained by the five methods are partially overlapping, as may be expected (Figure 3). Moreover, in some cases overlap may be limited to a shared subset of ANCORs identified by the different methods. Thus, assessing the exact relationships among the sets requires careful scrutiny.
Where are ANCORs located?
ANCORs are dispersed throughout the genome. They are located in 'gene territories': transcribed 3' UTRs, 5' UTRs or introns, as well as gene-proximal upstream and downstream non-transcribed regions. In the latter case they are more likely to serve as cis-regulatory elements. But they are also found outside such territories, in regions remote from any genes. In general, interspecies conservation decreases with increasing distance from coding exons [8, 25], implying that gene territories should be enriched in ANCORs. Indeed, a significant ANCOR5% enrichment has been reported for introns as compared to intergenic regions in the human CFTR region (encoding the cystic fibrosis transmembrane regulator) . In contrast, whole-genome perspectives have identified a negative correlation between the number of ANCORs and the number of coding sequences within genomic intervals [18, 26, 27]. This is also corroborated by the observation that one third of the rare ANCOR0.002% elements are located in 'gene deserts', more than 100 kb away from any gene.
Another feature of nonrandom genomic distribution is a tendency of ANCORs to appear in clusters [18, 20]. In parallel, ANCORs are reported to be enriched in gene deserts whose flanking genes are associated with transcription regulation, DNA binding, or development [14, 18, 20, 28]. The latter result points to a likelihood that ANCORs serve as distal cis-regulatory elements, potentially involved specifically in vertebrate development [14, 20].
ANCOR functional validation
Because of the conjectural aspects of ANCOR functionality, experimental evidence is extremely important for their validation. It is of course inherently impossible to prove that an ANCOR is non-functional, given the vast spectrum of potential ensuing phenotypes. One of the most obvious proposed ANCOR functions is transcription regulation. Accordingly, one of the earliest relevant studies has demonstrated that approximately the top 20% of mouse-human conserved segments (ANCOR20%) contain a statistically significant twofold excess of experimentally verified upstream transcription factor binding sites . Similarly, the set of ANCOR5% in the CFTR region overlaps with 63% of the functionally validated regulatory elements .
A corroboration for this notion is found in numerous functional assessments of ANCORs revealed by human-fish comparison (see [28, 31] for reviews). In one example , two gene deserts, flanking the human dachshund homolog 1 (DACH1) gene, were subjected to amphibian and fish comparisons. This appears to be a rather atypical region in terms of ANCOR content (Figure 4), having a strongly elevated incidence of highly conserved segments. Of nine conserved elements identified, seven displayed in vivo enhancer activity in transgenic mice. Similarly, when ANCOR0.01% segments were identified by human-Fugu whole-genome comparison, a functionality rate of 23 out of 25 ANCORs (> 90%) was observed by an enhancer assay, based on a transient co-injection of each element with a promoter-reporter gene construct . The general conclusion is that only the top few hundred ANCORs (at incidence levels of < 0.01%) have a high probability of being functional. Alternatively, it is also possible that the function of this fraction of ANCORs is more obvious and can be tested using conventional experimental protocols, but the function of the remainder is more subtle.
We propose a parsimony-based conjecture, namely that functional non-coding segments (Figure 5a) manifest a sequence-similarity distribution similar to that of coding exons (Figure 4). This is based on the observation that the number of ultraconserved segments is comparable in coding and non-coding regions , and on the notion that selective constraints are not expected to be vastly different for the two types of functional segments. In both, different elements are expected to be under varied stringencies of selection, yielding a normal-like distribution. It may be computed that nonfunctional blocks of 100 bp with total identity (100%) are too rare to appear even once in the entire mammalian genome when neutral DNA is concerned, while a few dozen such elements are expected within the selected fraction (Figure 5b). Importantly, this very crude model predicts an appreciable number of instances of perfect identity, without assuming a distinguished population of hyper-selected or hypo-mutable DNA elements. Nevertheless, in reality there is an excess of perfect identity regions  (Figure 4), suggesting a further contribution of selective pressure.
According to this model, and as corroborated by assertions in the literature , mere knowledge of interspecies sequence identity is a rather weak predictor of functional importance. For example, according to the computed curves shown in Figure 5b, a sequence identity level of around 80% is associated with an equal probability of being functional or nonfunctional. On the other hand, it is expected that sequence identity criteria will continue to be a key method for identifying functional noncoding DNA. Thus, focusing on ultraconserved segments - ANCORs with identity scores near 100% and/or frequency of < 0.01% - will be instrumental, their status more clearly implying an association with function.
The definition of a gene is far from straightforward . It is widely accepted that genomic segments that are transcribed into functional RNAs but do not code for proteins may be regarded as genes. This includes genes for, among others, microRNAs that fulfill central roles in gene regulatory networks [34, 35]. Many ANCORs may belong to existing categories of RNA-coding genes, or may be related to gene-proximal control elements that can safely be defined as parts of existing protein-coding genes. But the broader conservation picture that emerges, as described in this review, suggests the existence of highly conserved segments far away from other genes. Some of these have already been submitted to the EMBL database with gene-like annotations . Future scrutiny will help decide whether these genomic objects may be legitimately regarded as new classes of bona fide genes.
Additional data files
The following additional data files are available with the online version of this article: Additional data file 1 listing reported sets of noncoding conserved elements, and calculation of their frequency values; Additional data file 2 detailing the statistical properties of similarity distributions used to produce Figure 1; Additional data file 3 providing the raw data of percentage identity versus frequency as presented in Figure 1; Additional data file 4 giving the genomic coordinates of the DNA segments analyzed in Figure 4; and Additional data file 5 detailing the statistical properties of the similarity distributions presented in Figure 4.
D.L holds the Ralph and Lois Chair in Human Genetics. This research was supported by the Crown Human Genome Center, and by an Israel Ministry of Science and Technology grant to the National Knowledge Center in Genomics.
- Tagle D, Koop B, Goodman M, Slightom J, Hess D, Jones R: Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J Mol Biol. 1988, 203: 439-455. 10.1016/0022-2836(88)90011-3.PubMedView Article
- Gumucio DL, Heilstedt-Williamson H, Gray TA, Tarle SA, Shelton DA, Tagle DA, Slightom JL, Goodman M, Collins FS: Phylogenetic footprinting reveals a nuclear protein which binds to silencer sequences in the human gamma and epsilon globin genes. Mol Cell Biol. 1992, 12: 4919-4929.PubMedPubMed CentralView Article
- Duret L, Bucher P: Searching for regulatory elements in human noncoding sequences. Curr Opin Struct Biol. 1997, 7: 399-406. 10.1016/S0959-440X(97)80058-9.PubMedView Article
- Dubchak I, Brudno M, Loots GG, Pachter L, Mayor C, Rubin EM, Frazer KA: Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res. 2000, 10: 1304-1306. 10.1101/gr.142200.PubMedPubMed CentralView Article
- Hardison R, Miller W: Use of long sequence alignments to study the evolution and regulation of mammalian globin gene clusters. Mol Biol Evol. 1993, 10: 73-102.PubMed
- Koop BF, Hood L: Striking sequence similarity over almost 100 kilobases of human and mouse T-cell receptor DNA. Nat Genet. 1994, 7: 48-53. 10.1038/ng0594-48.PubMedView Article
- Jareborg N, Birney E, Durbin R: Comparative analysis of non-coding regions of 77 orthologous mouse and human gene pairs. Genome Res. 1999, 9: 815-824. 10.1101/gr.9.9.815.PubMedPubMed CentralView Article
- Shabalina SA, Ogurtsov AY, Kondrashov VA, Kondrashov AS: Selective constraint in intergenic regions of human and mouse genomes. Trends Genet. 2001, 17: 373-376. 10.1016/S0168-9525(01)02344-7.PubMedView Article
- International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.View Article
- Mouse Genome Sequencing Consortium: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-562. 10.1038/nature01262.View Article
- Dermitzakis ET, Reymond A, Lyle R, Scamuffa N, Ucla C, Deutsch S, Stevenson BJ, Flegel V, Bucher P, Jongeneel CV, et al: Numerous potentially functional but non-genic conserved sequences on human chromosome 21. Nature. 2002, 420: 578-582. 10.1038/nature01251.PubMedView Article
- Dermitzakis ET, Kirkness E, Schwarz S, Birney E, Reymond A, Antonarakis SE: Comparison of human chromosome 21 conserved nongenic sequences (CNGs) with the mouse and dog genomes shows that their selective constraint is independent of their genic environment. Genome Res. 2004, 14: 852-859. 10.1101/gr.1934904.PubMedPubMed CentralView Article
- Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beckstrom-Sternberg SM, Margulies EH, Blanchette M, Siepel AC, Thomas PJ, McDowell JC, et al: Comparative analyses of multi-species sequences from targeted genomic regions. Nature. 2003, 424: 788-793. 10.1038/nature01858.PubMedView Article
- Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D: Ultraconserved elements in the human genome. Science. 2004, 304: 1321-1325. 10.1126/science.1098119.PubMedView Article
- Rat Genome Sequencing Project Consortium: Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature. 2004, 428: 493-521. 10.1038/nature02426.View Article
- Frazer KA, Tao H, Osoegawa K, de Jong PJ, Chen X, Doherty MF, Cox DR: Noncoding sequences conserved in a limited number of mammals in the SIM2 interval are frequently functional. Genome Res. 2004, 14: 367-372. 10.1101/gr.1961204.PubMedPubMed CentralView Article
- Margulies EH, Blanchette M, Haussler D, Green ED: Identification and characterization of multi-species conserved sequences. Genome Res. 2003, 13: 2507-2518. 10.1101/gr.1602203.PubMedPubMed CentralView Article
- International Chicken Genome Sequencing Consortium: Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004, 432: 695-716. 10.1038/nature03154.View Article
- Aparicio S, Chapman J, Stupka E, Putnam N, Chia J-M, Dehal P, Christoffels A, Rash S, Hoon S, Smit A, et al: Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science. 2002, 297: 1301-1310. 10.1126/science.1072104.PubMedView Article
- Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, et al: Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 2005, 3: e7-10.1371/journal.pbio.0030007.PubMedPubMed CentralView Article
- Bejerano G, Haussler D, Blanchette M: Into the heart of darkness: large-scale clustering of human non-coding DNA. Bioinformatics. 2004, 20 (Suppl 1): i40-i48. 10.1093/bioinformatics/bth946.PubMedView Article
- Liebich I, Bode J, Frisch M, Wingender E: S/MARt DB: a database on scaffold/matrix attached regions. Nucleic Acids Res. 2002, 30: 372-374. 10.1093/nar/30.1.372.PubMedPubMed CentralView Article
- Glazko GV, Koonin EV, Rogozin IB, Shabalina SA: A significant fraction of conserved noncoding DNA in human and mouse consists of predicted matrix attachment regions. Trends Genet. 2003, 19: 119-124. 10.1016/S0168-9525(03)00016-7.PubMedView Article
- Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman W: Identification of conserved regulatory elements by comparative genome analysis. J Biol. 2003, 2: 13-10.1186/1475-4924-2-13.PubMedPubMed CentralView Article
- Keightley PD, Gaffney DJ: Functional constraints and frequency of deleterious mutations in noncoding DNA of rodents. Proc Natl Acad Sci USA. 2003, 100: 13402-13406. 10.1073/pnas.2233252100.PubMedPubMed CentralView Article
- Dermitzakis ET, Reymond A, Scamuffa N, Ucla C, Kirkness E, Rossier C, Antonarakis SE: Evolutionary discrimination of mammalian conserved non-genic sequences (CNGs). Science. 2003, 302: 1033-1035. 10.1126/science.1087047.PubMedView Article
- Gaffney DJ, Keightley PD: Unexpected conserved non-coding DNA blocks in mammals. Trends Genet. 2004, 20: 332-337. 10.1016/j.tig.2004.06.011.PubMedView Article
- Boffelli D, Nobrega MA, Rubin EM: Comparative genomics at the vertebrate extremes. Nat Rev Genet. 2004, 5: 456-465. 10.1038/nrg1350.PubMedView Article
- Levy S, Hannenhalli S, Workman C: Enrichment of regulatory signals in conserved non-coding genomic sequence. Bioinformatics. 2001, 17: 871-877. 10.1093/bioinformatics/17.10.871.PubMedView Article
- Nobrega MA, Zhu Y, Plajzer-Frick I, Afzal V, Rubin EM: Megabase deletions of gene deserts result in viable mice. Nature. 2004, 431: 988-993. 10.1038/nature03022.PubMedView Article
- Elgar G: Identification and analysis of cis-regulatory elements in development using comparative genomics with the pufferfish, Fugu rubripes. Semin Cell Dev Biol. 2004, 15: 715-719. 10.1016/j.semcdb.2004.10.001.PubMedView Article
- Nobrega MA, Ovcharenko I, Afzal V, Rubin EM: Scanning human gene deserts for long-range enhancers. Science. 2003, 302: 413-10.1126/science.1088328.PubMedView Article
- Mattick J: Challenging the dogma: the hidden layer of non-protein-coding RNAs in complex organisms. BioEssays. 2003, 25: 930-939. 10.1002/bies.10332.PubMedView Article
- Bartel DP: MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004, 116: 281-297. 10.1016/S0092-8674(04)00045-5.PubMedView Article
- He L, Hannon GJ: MicroRNAs: small RNAs with a big role in gene regulation. Nat Rev Genet. 2004, 5: 631-10.1038/nrg1415.View Article
- UCSC Genome Browser. [http://genome.ucsc.edu/]
- Couronne O, Poliakov A, Bray N, Ishkhanov T, Ryaboy D, Rubin E, Pachter L, Dubchak I: Strategies and tools for whole-genome alignments. Genome Res. 2003, 13: 73-80. 10.1101/gr.762503.PubMedPubMed CentralView Article
- VISTA Genome Browser. [http://pipeline.lbl.gov/]
- Chiaromonte F, Weber RJ, Roskin KM, Diekhans M, Kent WJ, Haussler D: The share of human genomic DNA under selection estimated from human-mouse genomic alignments. Cold Spring Harb Symp Quant Bio. 2003, 68: 245-254. 10.1101/sqb.2003.68.245.View Article
- Mignone F, Grillo G, Liuni S, Pesole G: Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis. Nucleic Acids Res. 2003, 31: 4639-4645. 10.1093/nar/gkg483.PubMedPubMed CentralView Article
- Ovcharenko I, Nobrega MA, Loots GG, Stubbs L: ECR Browser: a tool for visualizing and accessing data from comparisons of multiple vertebrate genomes. Nucleic Acids Res. 2004, 32 (Web server issue): W280-W286.PubMedPubMed CentralView Article