Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach

Elemento, Olivier; Tavazoie, Saeed

doi:10.1186/gb-2005-6-2-r18

Method
Open access
Published: 26 January 2005

Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach

Olivier Elemento¹ &
Saeed Tavazoie¹

Genome Biology volume 6, Article number: R18 (2005) Cite this article

16k Accesses
99 Citations
Metrics details

Abstract

We describe a powerful new approach for discovering globally conserved regulatory elements between two genomes. The method is fast, simple and comprehensive, without requiring alignments. Its application to pairs of yeasts, worms, flies and mammals yields a large number of known and novel putative regulatory elements. Many of these are validated by independent biological observations, have spatial and/or orientation biases, are co-conserved with other elements and show surprising conservation across large phylogenetic distances.

Background

One of the major challenges facing biology is to reconstruct the entire network of protein-DNA interactions within living cells. A large fraction of protein-DNA interactions corresponds to transcriptional regulators binding DNA in the neighborhood of protein-coding and RNA genes. By interacting with RNA polymerase or recruiting chromatin-modifying machinery, transcriptional regulators increase or decrease the transcription rate of these genes. Transcriptional regulators bind specific DNA sequences upstream, within or downstream of the genes they regulate, and a large number of experimental and computational studies are aimed at locating these sites and understanding their functions (for example [1, 2]). The increasing availability of whole-genome sequences provides unprecedented opportunities for identifying binding sites and studying their evolution. The strong conservation of functional elements (binding sites, protein-coding genes, noncoding RNAs, and so on) across even distantly related species should make it possible to predict these functional elements and prioritize them for experimental validation. The few large-scale comparative genomics approaches for finding transcriptional regulatory elements have so far relied mostly on detecting locally conserved motifs within global alignments of orthologous upstream sequences [3, 4]. Although very powerful and straightforward, these approaches cannot be used when upstream regions are very divergent or have undergone genomic rearrangements. For example, aligning the mouse and puffer fish orthologous upstream regions would be very difficult, because of the great reduction that the puffer fish intergenic regions have undergone [5]. Also, global alignments cannot be used when the positions of regulatory elements within functionally conserved promoter regions have been scrambled, for example through genomic rearrangements. Also, global alignment-based approaches often generate an overwhelming number of predictions because of the basal conservation between the genomes under study. To reduce the number of predictions, multiple global alignments of upstream sequences from several related species have been used, yielding many new candidate binding sites [3, 4]. However, multiple (more than two) closely related genome sequences are not always available; moreover, by focusing only on regulatory elements that are conserved between several genomes, these approaches might miss elements that are conserved in more local areas of the phylogenetic tree.

Here we describe a simple and efficient comparative approach for finding short noncoding DNA sequences that are globally conserved between two genomes, independently of their specific location within their respective promoter regions. Our method, which we call FastCompare, is based on a principle that we have termed 'network-level conservation' [6], according to which the wiring of transcriptional regulatory networks should be largely conserved between two closely related genomes.

Our previous attempts at using network-level conservation relied on Gibbs sampling to find candidate regulatory elements [7]. However, Gibbs sampling and related algorithms are not fully appropriate in this context, because of the low density of actual binding sites in pairs of orthologous upstream regions. Moreover, these algorithms are non-deterministic, relatively slow, and rely on sequence sampling, which makes them likely to miss many regulatory elements. While our previous approach was successful at predicting a large fraction of functional regulatory elements in the relatively small yeast genome, analyzing larger and more complex metazoan genomes requires faster and more exhaustive algorithms. Here, we use a faster, simpler and more comprehensive approach for detecting conserved and probably functional regulatory elements using the network-level conservation principle. FastCompare allows comprehensive exploration of the conserved - but not aligned - motifs between two genomes, while retaining a linear time complexity. We apply our approach to a large number of species, including yeasts, worms, flies and mammals, and describe some of the most conserved known and unknown regulatory elements within these genomes. We also show how this approach may help reconstruct part of the transcriptional network and reveal some of its associated constraints. Finally, we show that a large number of predicted motifs are conserved within and across different phylogenetic groups.

Results

In the following sections, pairs of closely related species are termed phylogenetic groups. We applied FastCompare to the four following phylogenetic groups: yeasts (Saccharomyces cerevisiae and S. bayanus), worms (Caenorhabditis elegans and C. briggsae), flies (Drosophila melanogaster and D. pseudoobscura) and mammals (Homo sapiens and Mus musculus). For each phylogenetic group, we describe some of the most interesting, known and novel, predicted regulatory elements. For each of these regulatory elements, we perform independent validation using gene expression data, chromatin immunoprecipitation (IP) data, known motifs and data from several biological databases (Gene Ontology (GO)/MIPS, TRANSFAC), and show that the most globally conserved predicted regulatory elements are strongly supported by these independent sources.

Yeasts

The average nucleotide identity between S. cerevisiae and S. bayanus upstream regions is approximately 62% [4] (similar to the identity between human and mouse upstream regions) and divergence times are estimated between 5 and 20 million years [4]. The number of ortholog pairs between S. cerevisiae and S. bayanus is 4,358 (see Materials and methods). We chose to analyze 1 kb-long upstream regions, because most of the known transcription factor binding sites in S. cerevisiae are located within this range [8]. Using FastCompare, we calculated a conservation score for all possible 7-, 8- and 9-mers on the corresponding 8.6 megabase-pairs (Mbp) of sequences and sorted each list separately according to conservation score (see Figure 1; the raw sorted lists are available on our website [9]). On a typical desktop PC, this analysis took approximately 5 minutes (for example, the entire set (8,170) of 7-mers was processed in 35 seconds).

Distribution of conservation scores

As described in Materials and methods, conservation scores are calculated for all k-mers (with fixed k), and are relative measures of network-level conservation for these k-mers (the higher the conservation score, the more conserved the corresponding k-mer). We first describe the distribution of conservation scores for all 7-mers. As shown in Figure 2, the distribution of conservation scores has a very long tail and many 7-mers on the tail correspond to well known regulatory elements in S. cerevisiae (see below for a detailed description of these sites). To verify that such high conservation scores could not be obtained by chance, we generated randomized sequences as described in Materials and methods and re-ran FastCompare on these sequences. The corresponding distribution of conservation scores is shown on Figure 2 and clearly shows that the high conservation scores corresponding to known regulatory elements are extremely unlikely to arise by chance.

Validation using independent biological data

We used various independent sources of biological data to demonstrate that k-mers with the highest conservation scores are likely to be functional. For a given k-mer, we define the 'conserved set' as the set of ORFs corresponding to the overlap between the two sets of orthologous ORFs containing at least one exact match to the k-mer in their upstream regions (see Materials and methods). We found that conserved sets defined for the highest-scoring 7-mers are significantly enriched with genes whose upstream regions contain occurrences of known motifs in yeast (Figure 3a), significantly enriched with genes whose upstream regions were shown to be bound by known transcription factors in vivo (Figure 3b), and significantly enriched in at least one MIPS functional category (Figure 3c). We also show that the number of 7-mers found upstream of over- or underexpressed genes in at least one microarray condition increases with the conservation score (Figure 3d) and that the number of 7-mers matching at least one TRANSFAC consensus also increases with the conservation score (Figure 3e). Altogether, these data provide strong and independent evidence that our method identifies functional yeast regulatory elements by giving them a high conservation score.

Closer examination of Figure 3a-d shows that the 400 highest-scoring 7-mers are most strongly supported by independent data. Therefore we retain them for further analysis and, when possible, replace them by 8-mers and 9-mers with higher conservation scores and also add the high-scoring 8-mers and 9-mers without high-scoring substrings, as described in Materials and methods. This processing yields 398 k-mers (k = 7, 8 and 9).

Then, for each of these 398 k-mers, we determine the optimal window within the initial 1 kb which maximizes the conservation score (see Materials and methods); we then re-evaluate the functionality of each of the 398 k-mers with the independent biological information described above, using the new conserved sets. The full information for the 398 k-mers is available at [9].

Known regulatory elements

Using known transcription factor binding site motifs, genome-wide in vivo binding data, functional annotation and literature searches, we found at least 27 different known transcription factor binding sites among the 398 highest scoring k-mers. These regulatory elements, along with their support from independent biological data, are shown in Table 1. Some of the best-known binding sites are represented several times within the 398 top scoring k-mers, in the form of slightly distinct or overlapping sequences (see [9]). Note also that we use very stringent criteria for identifying known binding sites among our predictions. When we matched our predictions to the known motifs published in [4] (regular expressions), we predicted 42 out of 53 known motifs (Kellis et al. [4] predict exactly the same number of motifs, and essentially the same motifs, but using multiple alignments of four yeast genomes).

Table 1 Known regulatory elements obtained when applying FastCompare to S. cerevisiae and S. bayanus

Full size table

Among the 27 different known regulatory elements returned by FastCompare, several (Swi4, Mbp1, Sum1/Ndt80, Fkh1/2) are involved in regulating the yeast cell cycle. The other known sites are also involved in fundamental biological processes in yeast: amino-acid metabolism (Cbf1, Gcn4), meiosis (Ume6), rRNA transcription (PAC and RRPE), proteolytic degradation (Rpn4), stress response (Msn2/Msn4) and general activation/repression (Rap1, Reb1). As described in Materials and methods, our approach also handles gapped motifs. Thus, the binding sites for Abf1, a chromatin reorganizing transcription factor (CGTNNNNNNTGA), and Mcm1, a factor involved in cell-cycle regulation and pheromone response (CCCNNNNNGGA), were also identified as very high-scoring patterns and strongly supported by independent information (known motifs and chromatin immunoprecipitation).

When we used the same independent biological data to evaluate the 400 highest-scoring 7-mers obtained on randomized data, we found only three known binding sites (RRPE, FKH1 and BAS1).

Several known binding sites are not found among the 398 top-scoring k-mers, perhaps because their transcriptional network has undergone extensive rewiring since the speciation of the two yeasts, or because the corresponding transcription factors regulate few genes. In some cases, the presence of several known sites (clearly identified in terms of independent data) among the full set of 7-mers argues in favor of the rewiring hypothesis. For example, the binding site for the Rcs1 transcription factor, TGCACCC, only appears at the 1,883rd position within the list of ranked 7-mers. Despite its lack of conservation, this site is strongly backed by independent biological information: it is identified as a known motif, it is found in 33 microarray conditions, and its conserved set is significantly enriched in genes annotated with homeostasis of metal ions (p < 10^-5), which is the known function for Rcs1 [10]. Similarly, the known binding sites for the Ace2/Swi5 and Hsf1 transcription factors were clearly identified (in terms of independent data) within the complete list of 7-mers, but not among the 398 highest scoring k-mers.

Positional constraints

It is now known that functional regulatory elements can be positionally constrained, relative to other regulatory elements or to the start of transcription [7, 11, 12]. To assess whether some of the predicted regulatory elements are positionally constrained in yeast, we calculated the median distance to ATG for the conserved sets of each of the 398 k-mers and independently built the distribution of median distances to ATG for all 7-mers as described in Materials and methods (the distribution is shown in Figure 4) and found d_0.025 = 350 and d_0.975 = 680. In other words, a median distance to ATG of less than 350 or higher than 680 should each arise by chance with only a 2.5% probability. Among the 398 most conserved k-mers, more than a fifth (86) have their median distance below 350 (p < 10^-52), while only seven have a median distance greater than 680. A closer examination reveals that a few known sites are particularly constrained. For example, the binding sites for Reb1, PAC, TATA, Swi4, Rpn4, RRPE and Mbp1 are found to be situated relatively close to the start of translation, with a median distance to ATG between 150 and 300 bp. Some of these constraints were also found to be good predictors of gene expression in a recent study [11] (for RPN4, PAC and RRPE, for example). In contrast, binding sites for Met4, Ume6, Hap4, Rap1, Ino4 and Ste12 are found to be situated at a greater median distance, between 400 and 500 bp from ATG.

Novel predicted regulatory elements

We found many novel motifs among our highest-scoring predictions. For example, we found two strongly conserved motifs, AGGGTAA (rank 17) and TGTAAATA (rank 31), which are situated relatively close to ATG (with a median distance to ATG of 349 and 378.5 bp, respectively) and more often in upstream regions than in coding regions (with ratios of 1.95 and 1.83, respectively). Interestingly, TGTAAATA also has a statistically significant 5' to 3' orientation bias (binomial p-value < 10^-7). However, neither of the two putative sites is supported by independent biological data. Additional expression data may help define their biological role. Other sites, such as CAGCCGC or GCGCCGC are found upstream of over- or underexpressed genes in many microarray conditions (15 and 6, respectively). While these two sites are similar to the canonical Ume6-binding site, the latter was not found in any microarray conditions (as none of the microarray experiments we used is related to meiosis, the biological process which Ume6 is known to be involved in), suggesting that the two sites are bound by other factors.

Comparing closer and more distant yeast species

We repeated the same analysis on distinct pairs of yeast species other than S. cerevisiae/S. bayanus. We first compared S. cerevisiae and S. paradoxus (a much closer relative of S. cerevisiae) and found 15 of the 27 known motifs we obtained when comparing S. cerevisiae and S. bayanus (results are available at [9]). We also compared S. cerevisiae with S. castellii, which is a more distant relative within the Saccharomyces phylogenetic group. S. castelli is interesting in that its upstream regions cannot be globally aligned with those of S. cerevisiae, because of extensive sequence divergence [3]. We also found 15 of the 27 known motifs found in the S. cerevisiae/S. bayanus comparison (results at [9]), although they were different from the S. cerevisiae/S. paradoxus conserved motifs. Interesting similarities and differences in conservation were revealed when comparing the known motifs discovered in each comparison. For example, the PAC, RRPE and Mbp1 motifs were found within the highest-scoring k-mers in all three comparisons, hinting at the conserved role of the corresponding proteins. However, the Reb1-binding site, which was found to be highly conserved between S. cerevisiae and S. bayanus (rank 1), is much less conserved between S. cerevisiae and S. castelli (rank 230). This argues for extensive rewiring in the Reb1 transcriptional network in the lineage that led to S. castelli.

Motif interactions

To discover interactions between regulatory elements, we searched for co-conservation of pairs of high-scoring predicted regulatory elements, as described in Materials and methods. Not surprisingly, the most conserved interaction is between RRPE (AAAAATTTT) and PAC (CTCATCGC), with a median distance D = 22 bp [11, 13]. We also find that the Cbf1-binding site (CACGTGA) is strongly co-conserved with the Met4-binding site (CTGTGGC), and that these two sites are separated by a short distance (D = 44.5) in S. cerevisiae. Indeed, it has been shown that the binding of Cbf1 in the vicinity of a very similar sequence (AAACTGTG) enhances the DNA-binding affinity of a Met4-Met28-Met31 complex for this sequence [14], and that the median distance between the above Cbf1 and Met4 sites is small [15].

Many of the predicted interactions have not yet been experimentally studied. For example, we found that the highest scoring Reb1 motif (CGGGTAA) is significantly co-conserved with both the highest scoring RRPE motif (AAAAATTTT) and the highest scoring PAC motif (CTCATCGC), with a short median distance between the two sites in both cases (D = 38 and D = 63.5, respectively). The Reb1/RRPE interaction was also discovered independently as a good predictor of expression [11]. We also found that Reb1 interacts with the Cbf1 motif (CACGTGA), also at a short median distance (D = 30). An interesting interaction between RRPE and an unknown motif, TGAAGAA, displays a conserved set strongly enriched in translation (p < 10^-11), while RRPE alone is more strongly enriched in rRNA transcription (p < 10^-14). The full sorted list of interactions is available at [9].

Worms

In contrast to yeast, relatively little is known about cis-regulatory sequences in C. elegans. There is a dramatically greater complexity of transcriptional regulation in multicellular organisms. Indeed, transcription factors in multicellular organisms regulate cohorts of genes in different tissues and at different times during development [16]. C. elegans promoter regions often contain many domains of activation/repression and, as a result, are much larger than those in yeast.

We applied FastCompare to the genomes of C. elegans and C. briggsae, two worms that diverged about 50-120 million years ago [17]. The number of orthologous open reading frames (ORFs) between these two species is 13,046 and here we have only considered 2,000 bp upstream regions. It takes approximately 11 minutes for FastCompare to process the corresponding 50 Mbp of sequences and calculate a conservation score for all 7-, 8- and 9-mers on a typical desktop PC.

Validations

The distribution of conservation scores for all 7-mers shows that high conservation scores are unlikely to be obtained by chance (Figure 5a). As shown in Figure 5a, many known regulatory elements fall on the tail of the distribution. We then used functional categories, over- or underexpression, and TRANSFAC motifs to assess the ability of FastCompare to predict functional regulatory elements. Figure 5b-d shows that support for the highest-scoring k-mers by functional enrichment, expression and TRANSFAC strongly increases with conservation score. We have only retained the 400 highest-scoring 7-mers, which are particularly well supported by independent biological information as shown in Figure 5b,c. Starting from these 400 highest-scoring 7-mers, we obtain 437 k-mers (k = 7, 8 or 9) using the procedure described in Materials and methods.