- Open Access
Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach
Genome Biologyvolume 6, Article number: R18 (2005)
We describe a powerful new approach for discovering globally conserved regulatory elements between two genomes. The method is fast, simple and comprehensive, without requiring alignments. Its application to pairs of yeasts, worms, flies and mammals yields a large number of known and novel putative regulatory elements. Many of these are validated by independent biological observations, have spatial and/or orientation biases, are co-conserved with other elements and show surprising conservation across large phylogenetic distances.
One of the major challenges facing biology is to reconstruct the entire network of protein-DNA interactions within living cells. A large fraction of protein-DNA interactions corresponds to transcriptional regulators binding DNA in the neighborhood of protein-coding and RNA genes. By interacting with RNA polymerase or recruiting chromatin-modifying machinery, transcriptional regulators increase or decrease the transcription rate of these genes. Transcriptional regulators bind specific DNA sequences upstream, within or downstream of the genes they regulate, and a large number of experimental and computational studies are aimed at locating these sites and understanding their functions (for example [1, 2]). The increasing availability of whole-genome sequences provides unprecedented opportunities for identifying binding sites and studying their evolution. The strong conservation of functional elements (binding sites, protein-coding genes, noncoding RNAs, and so on) across even distantly related species should make it possible to predict these functional elements and prioritize them for experimental validation. The few large-scale comparative genomics approaches for finding transcriptional regulatory elements have so far relied mostly on detecting locally conserved motifs within global alignments of orthologous upstream sequences [3, 4]. Although very powerful and straightforward, these approaches cannot be used when upstream regions are very divergent or have undergone genomic rearrangements. For example, aligning the mouse and puffer fish orthologous upstream regions would be very difficult, because of the great reduction that the puffer fish intergenic regions have undergone . Also, global alignments cannot be used when the positions of regulatory elements within functionally conserved promoter regions have been scrambled, for example through genomic rearrangements. Also, global alignment-based approaches often generate an overwhelming number of predictions because of the basal conservation between the genomes under study. To reduce the number of predictions, multiple global alignments of upstream sequences from several related species have been used, yielding many new candidate binding sites [3, 4]. However, multiple (more than two) closely related genome sequences are not always available; moreover, by focusing only on regulatory elements that are conserved between several genomes, these approaches might miss elements that are conserved in more local areas of the phylogenetic tree.
Here we describe a simple and efficient comparative approach for finding short noncoding DNA sequences that are globally conserved between two genomes, independently of their specific location within their respective promoter regions. Our method, which we call FastCompare, is based on a principle that we have termed 'network-level conservation' , according to which the wiring of transcriptional regulatory networks should be largely conserved between two closely related genomes.
Our previous attempts at using network-level conservation relied on Gibbs sampling to find candidate regulatory elements . However, Gibbs sampling and related algorithms are not fully appropriate in this context, because of the low density of actual binding sites in pairs of orthologous upstream regions. Moreover, these algorithms are non-deterministic, relatively slow, and rely on sequence sampling, which makes them likely to miss many regulatory elements. While our previous approach was successful at predicting a large fraction of functional regulatory elements in the relatively small yeast genome, analyzing larger and more complex metazoan genomes requires faster and more exhaustive algorithms. Here, we use a faster, simpler and more comprehensive approach for detecting conserved and probably functional regulatory elements using the network-level conservation principle. FastCompare allows comprehensive exploration of the conserved - but not aligned - motifs between two genomes, while retaining a linear time complexity. We apply our approach to a large number of species, including yeasts, worms, flies and mammals, and describe some of the most conserved known and unknown regulatory elements within these genomes. We also show how this approach may help reconstruct part of the transcriptional network and reveal some of its associated constraints. Finally, we show that a large number of predicted motifs are conserved within and across different phylogenetic groups.
In the following sections, pairs of closely related species are termed phylogenetic groups. We applied FastCompare to the four following phylogenetic groups: yeasts (Saccharomyces cerevisiae and S. bayanus), worms (Caenorhabditis elegans and C. briggsae), flies (Drosophila melanogaster and D. pseudoobscura) and mammals (Homo sapiens and Mus musculus). For each phylogenetic group, we describe some of the most interesting, known and novel, predicted regulatory elements. For each of these regulatory elements, we perform independent validation using gene expression data, chromatin immunoprecipitation (IP) data, known motifs and data from several biological databases (Gene Ontology (GO)/MIPS, TRANSFAC), and show that the most globally conserved predicted regulatory elements are strongly supported by these independent sources.
The average nucleotide identity between S. cerevisiae and S. bayanus upstream regions is approximately 62%  (similar to the identity between human and mouse upstream regions) and divergence times are estimated between 5 and 20 million years . The number of ortholog pairs between S. cerevisiae and S. bayanus is 4,358 (see Materials and methods). We chose to analyze 1 kb-long upstream regions, because most of the known transcription factor binding sites in S. cerevisiae are located within this range . Using FastCompare, we calculated a conservation score for all possible 7-, 8- and 9-mers on the corresponding 8.6 megabase-pairs (Mbp) of sequences and sorted each list separately according to conservation score (see Figure 1; the raw sorted lists are available on our website ). On a typical desktop PC, this analysis took approximately 5 minutes (for example, the entire set (8,170) of 7-mers was processed in 35 seconds).
Distribution of conservation scores
As described in Materials and methods, conservation scores are calculated for all k-mers (with fixed k), and are relative measures of network-level conservation for these k-mers (the higher the conservation score, the more conserved the corresponding k-mer). We first describe the distribution of conservation scores for all 7-mers. As shown in Figure 2, the distribution of conservation scores has a very long tail and many 7-mers on the tail correspond to well known regulatory elements in S. cerevisiae (see below for a detailed description of these sites). To verify that such high conservation scores could not be obtained by chance, we generated randomized sequences as described in Materials and methods and re-ran FastCompare on these sequences. The corresponding distribution of conservation scores is shown on Figure 2 and clearly shows that the high conservation scores corresponding to known regulatory elements are extremely unlikely to arise by chance.
Validation using independent biological data
We used various independent sources of biological data to demonstrate that k-mers with the highest conservation scores are likely to be functional. For a given k-mer, we define the 'conserved set' as the set of ORFs corresponding to the overlap between the two sets of orthologous ORFs containing at least one exact match to the k-mer in their upstream regions (see Materials and methods). We found that conserved sets defined for the highest-scoring 7-mers are significantly enriched with genes whose upstream regions contain occurrences of known motifs in yeast (Figure 3a), significantly enriched with genes whose upstream regions were shown to be bound by known transcription factors in vivo (Figure 3b), and significantly enriched in at least one MIPS functional category (Figure 3c). We also show that the number of 7-mers found upstream of over- or underexpressed genes in at least one microarray condition increases with the conservation score (Figure 3d) and that the number of 7-mers matching at least one TRANSFAC consensus also increases with the conservation score (Figure 3e). Altogether, these data provide strong and independent evidence that our method identifies functional yeast regulatory elements by giving them a high conservation score.
Closer examination of Figure 3a-d shows that the 400 highest-scoring 7-mers are most strongly supported by independent data. Therefore we retain them for further analysis and, when possible, replace them by 8-mers and 9-mers with higher conservation scores and also add the high-scoring 8-mers and 9-mers without high-scoring substrings, as described in Materials and methods. This processing yields 398 k-mers (k = 7, 8 and 9).
Then, for each of these 398 k-mers, we determine the optimal window within the initial 1 kb which maximizes the conservation score (see Materials and methods); we then re-evaluate the functionality of each of the 398 k-mers with the independent biological information described above, using the new conserved sets. The full information for the 398 k-mers is available at .
Known regulatory elements
Using known transcription factor binding site motifs, genome-wide in vivo binding data, functional annotation and literature searches, we found at least 27 different known transcription factor binding sites among the 398 highest scoring k-mers. These regulatory elements, along with their support from independent biological data, are shown in Table 1. Some of the best-known binding sites are represented several times within the 398 top scoring k-mers, in the form of slightly distinct or overlapping sequences (see ). Note also that we use very stringent criteria for identifying known binding sites among our predictions. When we matched our predictions to the known motifs published in  (regular expressions), we predicted 42 out of 53 known motifs (Kellis et al.  predict exactly the same number of motifs, and essentially the same motifs, but using multiple alignments of four yeast genomes).
Among the 27 different known regulatory elements returned by FastCompare, several (Swi4, Mbp1, Sum1/Ndt80, Fkh1/2) are involved in regulating the yeast cell cycle. The other known sites are also involved in fundamental biological processes in yeast: amino-acid metabolism (Cbf1, Gcn4), meiosis (Ume6), rRNA transcription (PAC and RRPE), proteolytic degradation (Rpn4), stress response (Msn2/Msn4) and general activation/repression (Rap1, Reb1). As described in Materials and methods, our approach also handles gapped motifs. Thus, the binding sites for Abf1, a chromatin reorganizing transcription factor (CGTNNNNNNTGA), and Mcm1, a factor involved in cell-cycle regulation and pheromone response (CCCNNNNNGGA), were also identified as very high-scoring patterns and strongly supported by independent information (known motifs and chromatin immunoprecipitation).
When we used the same independent biological data to evaluate the 400 highest-scoring 7-mers obtained on randomized data, we found only three known binding sites (RRPE, FKH1 and BAS1).
Several known binding sites are not found among the 398 top-scoring k-mers, perhaps because their transcriptional network has undergone extensive rewiring since the speciation of the two yeasts, or because the corresponding transcription factors regulate few genes. In some cases, the presence of several known sites (clearly identified in terms of independent data) among the full set of 7-mers argues in favor of the rewiring hypothesis. For example, the binding site for the Rcs1 transcription factor, TGCACCC, only appears at the 1,883rd position within the list of ranked 7-mers. Despite its lack of conservation, this site is strongly backed by independent biological information: it is identified as a known motif, it is found in 33 microarray conditions, and its conserved set is significantly enriched in genes annotated with homeostasis of metal ions (p < 10-5), which is the known function for Rcs1 . Similarly, the known binding sites for the Ace2/Swi5 and Hsf1 transcription factors were clearly identified (in terms of independent data) within the complete list of 7-mers, but not among the 398 highest scoring k-mers.
It is now known that functional regulatory elements can be positionally constrained, relative to other regulatory elements or to the start of transcription [7, 11, 12]. To assess whether some of the predicted regulatory elements are positionally constrained in yeast, we calculated the median distance to ATG for the conserved sets of each of the 398 k-mers and independently built the distribution of median distances to ATG for all 7-mers as described in Materials and methods (the distribution is shown in Figure 4) and found d0.025 = 350 and d0.975 = 680. In other words, a median distance to ATG of less than 350 or higher than 680 should each arise by chance with only a 2.5% probability. Among the 398 most conserved k-mers, more than a fifth (86) have their median distance below 350 (p < 10-52), while only seven have a median distance greater than 680. A closer examination reveals that a few known sites are particularly constrained. For example, the binding sites for Reb1, PAC, TATA, Swi4, Rpn4, RRPE and Mbp1 are found to be situated relatively close to the start of translation, with a median distance to ATG between 150 and 300 bp. Some of these constraints were also found to be good predictors of gene expression in a recent study  (for RPN4, PAC and RRPE, for example). In contrast, binding sites for Met4, Ume6, Hap4, Rap1, Ino4 and Ste12 are found to be situated at a greater median distance, between 400 and 500 bp from ATG.
Novel predicted regulatory elements
We found many novel motifs among our highest-scoring predictions. For example, we found two strongly conserved motifs, AGGGTAA (rank 17) and TGTAAATA (rank 31), which are situated relatively close to ATG (with a median distance to ATG of 349 and 378.5 bp, respectively) and more often in upstream regions than in coding regions (with ratios of 1.95 and 1.83, respectively). Interestingly, TGTAAATA also has a statistically significant 5' to 3' orientation bias (binomial p-value < 10-7). However, neither of the two putative sites is supported by independent biological data. Additional expression data may help define their biological role. Other sites, such as CAGCCGC or GCGCCGC are found upstream of over- or underexpressed genes in many microarray conditions (15 and 6, respectively). While these two sites are similar to the canonical Ume6-binding site, the latter was not found in any microarray conditions (as none of the microarray experiments we used is related to meiosis, the biological process which Ume6 is known to be involved in), suggesting that the two sites are bound by other factors.
Comparing closer and more distant yeast species
We repeated the same analysis on distinct pairs of yeast species other than S. cerevisiae/S. bayanus. We first compared S. cerevisiae and S. paradoxus (a much closer relative of S. cerevisiae) and found 15 of the 27 known motifs we obtained when comparing S. cerevisiae and S. bayanus (results are available at ). We also compared S. cerevisiae with S. castellii, which is a more distant relative within the Saccharomyces phylogenetic group. S. castelli is interesting in that its upstream regions cannot be globally aligned with those of S. cerevisiae, because of extensive sequence divergence . We also found 15 of the 27 known motifs found in the S. cerevisiae/S. bayanus comparison (results at ), although they were different from the S. cerevisiae/S. paradoxus conserved motifs. Interesting similarities and differences in conservation were revealed when comparing the known motifs discovered in each comparison. For example, the PAC, RRPE and Mbp1 motifs were found within the highest-scoring k-mers in all three comparisons, hinting at the conserved role of the corresponding proteins. However, the Reb1-binding site, which was found to be highly conserved between S. cerevisiae and S. bayanus (rank 1), is much less conserved between S. cerevisiae and S. castelli (rank 230). This argues for extensive rewiring in the Reb1 transcriptional network in the lineage that led to S. castelli.
To discover interactions between regulatory elements, we searched for co-conservation of pairs of high-scoring predicted regulatory elements, as described in Materials and methods. Not surprisingly, the most conserved interaction is between RRPE (AAAAATTTT) and PAC (CTCATCGC), with a median distance D = 22 bp [11, 13]. We also find that the Cbf1-binding site (CACGTGA) is strongly co-conserved with the Met4-binding site (CTGTGGC), and that these two sites are separated by a short distance (D = 44.5) in S. cerevisiae. Indeed, it has been shown that the binding of Cbf1 in the vicinity of a very similar sequence (AAACTGTG) enhances the DNA-binding affinity of a Met4-Met28-Met31 complex for this sequence , and that the median distance between the above Cbf1 and Met4 sites is small .
Many of the predicted interactions have not yet been experimentally studied. For example, we found that the highest scoring Reb1 motif (CGGGTAA) is significantly co-conserved with both the highest scoring RRPE motif (AAAAATTTT) and the highest scoring PAC motif (CTCATCGC), with a short median distance between the two sites in both cases (D = 38 and D = 63.5, respectively). The Reb1/RRPE interaction was also discovered independently as a good predictor of expression . We also found that Reb1 interacts with the Cbf1 motif (CACGTGA), also at a short median distance (D = 30). An interesting interaction between RRPE and an unknown motif, TGAAGAA, displays a conserved set strongly enriched in translation (p < 10-11), while RRPE alone is more strongly enriched in rRNA transcription (p < 10-14). The full sorted list of interactions is available at .
In contrast to yeast, relatively little is known about cis-regulatory sequences in C. elegans. There is a dramatically greater complexity of transcriptional regulation in multicellular organisms. Indeed, transcription factors in multicellular organisms regulate cohorts of genes in different tissues and at different times during development . C. elegans promoter regions often contain many domains of activation/repression and, as a result, are much larger than those in yeast.
We applied FastCompare to the genomes of C. elegans and C. briggsae, two worms that diverged about 50-120 million years ago . The number of orthologous open reading frames (ORFs) between these two species is 13,046 and here we have only considered 2,000 bp upstream regions. It takes approximately 11 minutes for FastCompare to process the corresponding 50 Mbp of sequences and calculate a conservation score for all 7-, 8- and 9-mers on a typical desktop PC.
The distribution of conservation scores for all 7-mers shows that high conservation scores are unlikely to be obtained by chance (Figure 5a). As shown in Figure 5a, many known regulatory elements fall on the tail of the distribution. We then used functional categories, over- or underexpression, and TRANSFAC motifs to assess the ability of FastCompare to predict functional regulatory elements. Figure 5b-d shows that support for the highest-scoring k-mers by functional enrichment, expression and TRANSFAC strongly increases with conservation score. We have only retained the 400 highest-scoring 7-mers, which are particularly well supported by independent biological information as shown in Figure 5b,c. Starting from these 400 highest-scoring 7-mers, we obtain 437 k-mers (k = 7, 8 or 9) using the procedure described in Materials and methods.
Known regulatory elements
As shown in Table 2, at least 15 distinct known binding sites in C. elegans and other metazoan organisms were identified among the 437 predicted regulatory elements.
One of the most conserved is TGATAAG, the binding site for the GATA factors, a family of regulators controlling intestinal development (see  for review). Another motif returned by FastCompare, GTGTTTGC, corresponds to the binding site for the forkhead-related activator-4 (Freac-4) . Note that this motif is also compatible with the PHA-4-binding site (published consensus: T[AG]TT[GT][AG][CT] ), present in the upstream regions of pharyngeal genes  (PHA-4 is also a member of the forkhead family of transcription factors). FastCompare also returned TGTCATCA, the known binding site for the SKN-1 transcription factor (published consensus [AT][AT]T[AG]TCAT). In C. elegans, SKN-1 is known to initiate mesendodermal development by inducing expression of the GATA factors MED-1 and MED-2 (required for mesendodermal differentiation in the EMS lineage) .
The GAGA-factor binding site (AGAGAGA) was also found as a highly conserved pattern. GAGA repeats in upstream regions have been shown to be functional in C. elegans in at least two separate studies [22, 23]. At least one GAGA-binding protein has been identified in D. melanogaster, and is assumed to create nucleosome-free regions of DNA, thus allowing additional transcription factors to bind those regions . However, the ortholog of this protein has not yet been identified in C. elegans .
We also found CAGCTGG, a site known to be bound by the myogenic basic helix-loop-helix (bHLH) family of transcription factors (in worms, flies and mammals) and AP-4 transcription factors (in mammals) [25, 26] (published consensus CAGCTG [27–29]). The homolog of human AP-4 was found to be ubiquitously expressed in D. melanogaster and a C. elegans homolog has also been identified . FastCompare returned GTAAACA, the known binding site for the DAF-16 transcription factor (published consensus GTAAACA [30, 31]). DAF-16, a FOXO-family transcription factor, was shown to influence the rate of aging of C. elegans in response to insulin/insulin-like growth factor-1 signaling [31, 32].
Searching for gapped motifs found few strongly conserved sites. However, when searching for 8-mers with a 5-bp gap, we found that TGGCNNNNNGCCA, the known binding site for nuclear factor I (NFI) , had a score comparable to those of the highest-scoring k-mers.
Several of the C. elegans sites returned by FastCompare and shown in Table 2 are known to be functional transcription factor binding sites in other species. For example, TGACTCAT, identical to the AP-1-binding site , is known to be bound in yeast (by Gcn4), Drosophila , mouse and human (see  for a review).
FastCompare also returns the CACGTGG motif, which is the binding site for the Myc/Max complex, a family of bHLH transcription factors . Among the top-scoring motifs in Table 2, we also find AAGGTCA, the hormone response element (HRE), bound by several transcription factors in human, mouse, fruit fly and silkworm (published consensus [CT]CAAGG[CT]C[AG] [38, 39]); TGACGTC, the cAMP response element (published consensus TGACGTCA ); CCCGCCC, the binding site for the mammalian Sp1 transcription factor (known consensus CCCCGCCCC); ATCAATCA, the known binding site for the human proto-oncogene Pbx-1 . A similar site, ATCAATTA, has been shown to be bound in vitro by the Drosophila homolog of Pbx-1, the extradenticle (exd) protein . Moreover, CEH-20C was identified as the C. elegans homolog of both Pbx-1 and exd. Other known sites discovered by FastCompare include CAGGTGA, similar to the known binding site for the Snail protein, a transcription factor involved in dorso-ventral pattern formation in Drosophila (published consensus [AG][AT][AG]ACAGGTG[CT]AC ), and TTCGCGC, the known binding site for the E2F proteins, a family of transcription factors involved in regulating the cell cycle in Drosophila and mammals (published consensus TTTCGCGC ). An E2F homolog has been identified in C. elegans and recently shown to be involved in cell-cycle regulation [45, 46].
Position and orientation biases
As in yeast, several of the known binding sites in C. elegans appear to be constrained in terms of position. Using the distribution of median distances for all 7-mers (see Materials and methods), we found d0.025 = 690 and d0.975 = 1,135. Among the 437 highest-scoring k-mers, we found that 75 are located below the lower threshold, a proportion that is much higher than the expected 2.5% (p < 10-38). The binding sites for forkhead-related activator-4 (Freac-4), Sp1, E2F and AP-1 are particularly constrained (see Figure 6). We found only 21 k-mers to be located further away from the distant d0.975 threshold. Interestingly, the most conserved k-mer among these 21, CCACCAGGA (rank 96), is found in the upstream regions of over- or underexpressed genes in 57 microarray conditions.
Note that for a few predicted elements (for example, CAGGTGA, rank 111), the median distance falls outside of the optimal window; this is due to the fact that, for these elements, the median distance does not correspond to the peak of the distribution of distances to ATG. Hence, for these elements, the optimal window provides a better descriptor of the positional bias than the median distance. Additional analysis reveals that several of the known binding sites discovered in this study are constrained in term of orientation. For example, the binding site for the GATA-factor(s) (as shown in Table 2) is significantly more often found in the 3' to 5' orientation, relative to downstream genes. Probably the most interesting finding is that the GAGA repeats appear to be strongly oriented 3' to 5' relative to their downstream genes. Indeed, 2,375 out of 3,557 (67%) of the AGAGAGA sites are oriented 3' to 5', a proportion that is much larger than the expected 50% (p < 10-90). This bias is confirmed by the fact that TCTCTCT alone (not taking into account its reverse complement) has a much higher conservation score (129.2) than AGAGAGA (34.3). We also found that several related motifs display a similar, albeit weaker, orientation bias, for example, GAAGAAG (p < 10-16), GGAGGAG (p < 10-10). It is interesting that all the GAGA repeats found to be necessary for correct expression of the ceh-24 and unc-54 genes are in fact TCTC repeats [22, 23]. The conserved sets for TCTCTCT or AGAGAGA were not found to be enriched in any GO category. Note that this orientation bias is not due to genes with the repeats in their upstream regions being predominantly located on one strand, as these genes are approximately identically distributed on each strand (1,065/1,122, p = 0.89). Interestingly, conserved GAGA repeats in D. melanogaster were also found to be constrained in terms of orientation, but at a much lower significance (p < 10-4, see below). Although it is possible that the TCTC repeats are bound at the 5' untranslated region (UTR) mRNA level, the positional distribution of the conserved AGAGAGA sites does not indicate a strong positional bias with respect to ATG (DATG = 893).
Novel predicted regulatory elements
FastCompare also returned many novel motifs; some of the most interesting ones are shown in Table 3. The top-scoring motif, CTGCGTCT, belongs to this category. A larger version of that motif, TCTGCGTCTCT, was found in a recent study to be necessary for the expression of several ethanol-response genes . However, the very high conservation of this site suggests a broader role. It is interesting to note that this site was not significantly found upstream of under- or overexpressed genes in any microarray conditions (including the data from ). Interestingly, the most conserved k-mer found in yeast, the binding site for the Reb1 protein, had the same property. Moreover, this site displays a relatively strong orientation bias 5' to 3' (p < 10-10).
Several of the other novel predicted regulatory elements in Table 3 have interesting properties. For example, the fourth most-conserved k-mer, CGACACTCC, is one of the closest motifs to ATG, with a median distance of 234 bp, and its conserved set is strongly enriched in genes involved in positive regulation of growth (a biological process defined in GO as the increase in size or mass of all or part of the worm) (p < 10-7). Another predicted regulatory element, CGAGACC (rank 20), is found upstream of downregulated genes in 23 microarray conditions. Interestingly, it is found upstream of downregulated genes in a study measuring gene-expression changes at several time points during worm aging , in two distinct strains (fer-15 and spe-9;fer-15) and at similar time points (6, 9 and 10 days for fer-15, 9 and 11 for spe-9;fer-15). In addition, the functional enrichment of its conserved set points at a potential role in embryonic development (p < 10-7). Another strongly conserved and novel motif, CTCCGCCC (rank 14), was independently found upstream of almost all transcribed worm microRNA genes in a recent study .
We found many interactions between the most conserved k-mers found at the previous stage. For example, the most conserved k-mer, TCTGCGTCT, is very often co-conserved with AGAGAGA. The high-scoring interaction between the DRE-like motif, AATCGAT and the putative E2F-binding site, TTTTCGC, also appears interesting. Indeed, the conserved sets for both k-mers are separately enriched significantly with genes involved in embryonic development, according to GO (p < 10-8 and p < 10-7, respectively). However, the conserved set of genes having both elements in their upstream regions is even more enriched in this GO category (p < 10-9). TTTTCGC also seems to interact with the novel site CGACACTCC, and the corresponding conserved set is enriched with genes involved in modification-dependent protein catabolism (p < 10-5). The full list of motif interactions is available at .
We applied FastCompare to the genomes of D. melanogaster and D. pseudoobscura, two species of Drosophila that diverged about 46 million years ago . The number of orthologous ORFs between these two species is 11,306 and here we only consider 2,000-bp upstream regions. Using 5,000 bp instead produced similar results, but also produced additional putative binding sites (results are available at ). It takes approximately 10 minutes for FastCompare to process the corresponding 45 Mbp of sequences and calculate a conservation score for all 7-mers, 8-mers and 9-mers on a typical desktop PC.
The distribution of conservation scores shown in Figure 7a, for actual and randomized data, shows once again that the high conservation scores obtained with the real sequences are very unlikely to be achieved by chance. Also, as shown in Figure 7a, many known regulatory elements fall on the tail of the distribution.
As for the yeast and worm genomes, we used functional annotations (GO), expression data and known TRANSFAC sites to evaluate the FastCompare predictions. Unfortunately, expression data is often available for only a subset of genes and its analysis led to very few validations. However, Figure 7b,c clearly shows that functional enrichment of the conserved sets and TRANSFAC matches strongly correlate with conservation score. As with yeasts and worms, we focused on the 400 highest-scoring 7-mers, which are particularly well supported by the functional enrichment analysis (see Figure 7b). The simple processing described in Materials and methods yielded 469 k-mers (k = 7, 8 or 9), which we further analyze below.
Known regulatory elements
As shown in Table 4a, we found at least 16 distinct known regulatory elements among the 469 highest-scoring k-mers. The most conserved element, AACAGCTG, is similar to the site known to be bound by AP-4 (mammals) and MyoD (worms, flies and mammals). One of the most interesting predictions is TATCGATA (rank 12); this palindromic motif, known as the DNA replication-related element (DRE), has been experimentally proved to be necessary for proper expression of several cell proliferation-related genes in D. melanogaster  and, more recently, the genes encoding the TATA-binding protein (TBP)  and catalase  in the same organism. Interestingly, it is both the motif with the closest median distance to ATG (DATG = 168), and the most over-represented k-mer (among the 469 highest scoring ones) within D. melanogaster upstream regions compared to exons, with a ratio of 5.39.
Several of the other predicted sites are known to be bound by Drosophila transcription factors involved in development. For example, FastCompare predicts TTTATGGC (rank 14) and TAATTGA (rank 24), the binding sites for two homeodomain transcription factors. The first site matches the TRANSFAC consensus binding site for Abd-B ([CG]NTTTATGGC), while the second site is the known consensus binding site for the Antennapedia (Antp) class of homeodomain proteins  (TAATTGA matches the TRANSFAC consensus binding site for Ubx, a member of the Antp class). FastCompare also predicts ATTTATGC, a site matching the TRANSFAC consensus binding site for the chicken CdxA protein ([AC]TTTAT[AG]), the homolog of the Caudal protein in D. melanogaster. Also, FastCompare predicts CAGGTGC, the binding site for the Snail repressor/activator protein, a transcription factor required for proper mesodermal development .
FastCompare also predicts ATTTGCATA (rank 3) as one of the most conserved putative regulatory elements between the two flies. This site is the binding site for the POU-domain family of transcription factors, and it is probably bound by one or several of the three POU-domain transcription factors in Drosophila: DFR, PDM-1 and PDM-2. These three proteins are involved in different stages of Drosophila development: DFR is expressed in midline glia and in tracheal cells , whereas the redundant PDM-1 and PDM-2 are essential for proper neuronal development .
Many of the known motifs found when comparing the two Drosophila genomes were also found when analyzing the worm genomes. For example, GAGA repeats are found to be strongly conserved, slightly oriented 3' to 5' (p < 10-4), and very significantly found upstream of genes involved in morphogenesis (p < 10-23). GTAAACA (rank 147), the DAF16-binding site in C. elegans, is also one of the most conserved sites between the two Drosophila genomes. This site is probably bound by dFOXO, the unique homolog of the C. elegans DAF16 protein in D. melanogaster .
As for both previous phylogenetic groups (yeasts and worms), the median distances to ATG for the conserved elements show that some of the predicted regulatory elements are severely constrained in terms of position. Among the most constrained k-mers are the DRE site (TATCGATA, DATG = 168) and the known AP-4/MyoD binding site (AACAGCTG, DATG = 373). However, both the optimal windows and the median distances in Table 4a show that, compared to previously studied organisms, a smaller number of conserved regulatory element are constrained. Using the distribution of median distances for all 7-mers, we find that the d0.025 = 798 and d0.975 = 1,126. Among the 469 highest scoring k-mers, 45 fall below 798 (p < 10-13) and 36 above 1,126 (p < 10-8), once again suggesting weaker positional constraints than in yeasts and worms, at least when considering the first 2,000 bp of 5' upstream sequences.
Novel predicted regulatory elements
FastCompare predicts many putative regulatory elements in Drosophila that to the best of our knowledge are unknown (Table 4b). One of these novel sites, CAGGTAG (rank 143), was found upstream of several genes that are activated before widespread activation of zygotic transcription (which begins during the 14th nuclear cycle), in several Drosophila species ; it was also found to be necessary for the early expression of several of these genes (Sxl and sisterlessB) in a subsequent study (J.R. ten Bosch, J.A. Benavides and T.W. Cline, personal communication). It is interesting to see that this particular site is significantly conserved upstream of genes involved in cell fate commitment (p < 10-8).
Some of these sites, such as the palindromic TTAATTA (rank 31), are found much more often in upstream regions than in exons (with an over-representation ratio of 3.07). Others, such as ACACACAC, are found to be significantly enriched upstream of genes in known functional categories (embryonic development, p < 10-9). The same site appears to be strongly oriented 5' to 3' (p < 10-12). Others, such as GTGTGACC or AAATGGCG, appear to be located closer to ATG than most other sites (DATG = 296 and 592, respectively).
We found many potential interactions between the most conserved sites discovered by FastCompare. For example, the POU-domain-binding site ATTTGCATA was found to be strongly co-conserved with TAATTGA, the Antp-binding site, and with many other potential homeodomain sites, such as AATAAAT and TAATTAA. The CACA repeats were also found to be co-conserved with several different sites, and in some cases, the set of genes having both sites simultaneously conserved in their upstream regions (conserved sets) was found to be enriched in certain functional categories, for example, ACACACAC and GAGAGAG, regulation of transcription (p < 10-12); ACACACAC and TAATTGC (an Antp variant site), embryonic development (p < 10-5). The full list of interactions is available at .
The much larger noncoding regions of mammalian genomes present significant challenges for computational motif discovery. Also, many repeat elements (for example, Alu) have colonized mammalian genomes and are likely to be conserved between closely related genomes. The distance between enhancers and the transcriptional start of the genes they regulate can be extremely large, reaching tens of kilobases. Finally, gene predictions and gene boundaries are still largely unverified experimentally for a large number of genes.
We applied FastCompare to the genomes of H. sapiens and M. musculus,, which diverged about 75 million years ago . The number of orthologous ORFs between these two species is 15,983 and again, we have only considered 2,000-bp upstream regions. As in flies, using 5,000-bp instead produced similar results. It takes approximately 15 minutes for FastCompare to process the corresponding 60 Mbp of sequences and calculate a conservation score for all 7-mers, 8-mers and 9-mers on a typical desktop PC.
Unlike the other genomes considered so far, the output of FastCompare from the mammalian genomes is dominated by GC-rich sequences, probably corresponding to CpG islands (GC-rich regions known to be associated with the promoters of many genes). However, analysis of the FastCompare output yielded the same validations as for other species. Indeed, the distribution of conservation scores obtained on actual and randomized sequences shows that high conservation scores are very unlikely to be obtained by chance (Figure 8a). As with other species, many known regulatory elements are on the tail of the distribution (Figure 8a). Also, as shown in Figure 8b-d, more k-mers are found upstream of over or underexpressed genes, more k-mers have their conserved set enriched with GO functional categories, and more k-mers match TRANSFAC consensus sites as the conservation score increases.
We found that masking Alu repeats did not influence the output of FastCompare (data not shown). To overcome the overabundance of GC-rich sequences in the FastCompare output, we use longer k-mers as starting points, namely 8-mers instead of 7-mers. We started with the 600 highest-scoring 8-mers, and replaced each of these 8-mers by one of its substrings (7-mer) or one of its superstrings (9-mer), when their conservation score is higher. We then removed duplicates in the list and added the high-scoring 9-mers that have no substrings within the list. This procedure yielded 284 k-mers (k = 7, 8, 9). Subsequent validation was limited to this small set of high-scoring predictions.
Known regulatory elements
As shown in Table 5a, we found 17 distinct known regulatory elements among the 284 highest-scoring k-mers. Among these are the well characterized sites for the Sp1, C/EBP, CREB and Myc/Max proteins or families of proteins. These four sites reside very close to ATG (their median distance to ATG is between 100 and 250 bp), suggesting that the four proteins (or families of proteins) may be involved in intimate interactions with the transcriptional complex. Sp1 is an ubiquitous transcription factor, involved in the basal expression of a large number of genes in mammals (see  for review). The CCAAT/enhancer binding protein (C/EBP) has been implicated in the regulation of cell-specific gene expression mainly in hepatocytes, adipocytes and hematopoietic cells (see  for review). Both Sp1 and C/EBP are constitutive transcription factors whose presence is necessary for significant induction of a large number of genes . The CRE-binding protein (CREB or CBP) is a transcription factor that binds cyclic AMP (cAMP) response elements (CREs) in the promoters of specific genes, and functions as a co-activator for a large number of other transcription factors (see  for review). The Myc/Max heterodimer binds the CACGTG sequence, and also acts as a transcriptional activator (see  for review).
Interestingly, we found that some of the most conserved interactions between k-mers (see Materials and methods) involve Sp1-binding sites (CCCGCCC or CCGCCCC) with other known sites such as CACGTGAC (Myc/Max), TGACGTCA (CREB), CGCAGGCGC (unknown), GCCAATC (CCAAT-box) and ACTTCCG (Ets), and that the median distances between these sites are relatively small (138, 164, 200.5, 234 and 234, respectively).
Among the other predicted regulatory elements returned by FastCompare are CCGCCTC, a site known as the insulin response element ; CGGAAGTGA, a site known to be bound by the GA-binding protein in human ; CGCATGCG, a site known as the palindromic octamer sequence, which was found at 132 bp (relative to ATG) upstream of the inosine-5'-monophosphate dehydrogenase type II gene in human, and shown to be functional in resting and activated T cells using site-directed mutagenesis, in vivo footprinting and electrophoretic mobility shift assay (EMSA) ; TTTCGCGC, the E2F-binding site; TAATCCCAG, a site known to be bound in D. melanogaster by the anterior morphogen Bicoid, and also recently shown to be bound in human by Goosecoid-like (GSCL) . Interestingly, this site has a relatively strong orientation bias 3' to 5' (p < 10-14). It is also the site with the strongest over-representation in upstream regions compared to exons that we observed, with a ratio of 7.06.
FastCompare also predicts ATTTGCAT, the binding site for the POU-domain Oct-1 and Oct-2 proteins, known to bind the promoter and intronic enhancer of immunoglobulin genes ; it also returns GGAAGTCCC, a site that was shown to bind NFκB [70, 71], a transcription factor involved in a variety of pathways (including inflammation, response to infection and oxidative stress, and apoptosis).
The distribution of distances to ATG for all 7-mers (Figure 9) shows an interesting bimodal shape, indicating that a large number of short sequences are constrained to reside around 500 bp to ATG. We calculated d0.025 = 342 and d0.975 = 1,185 and found that 83 k-mers among the 284 highest-scoring ones have a shorter median distance than 342 (p < 10-63) and only 11 have a larger median distance than 1,185. Indeed, a majority of the known sites identified by FastCompare are preferentially located near the 5' start of genes, with some elements being very close to ATG (for example, the CREB site, whose median distance to ATG is 107, whereas the optimal window is [0;1,000]). Nonetheless, a few known motifs do not seem to show any positional constraints. For example, the Bicoid-like site TAATCCCAG has a median distance to ATG of 1,258.
Novel predicted regulatory elements
FastCompare identifies many putative regulatory elements which to the best of our knowledge are novel (Table 5b). Some of these predicted regulatory elements are found upstream of over- or underexpressed genes in many microarray conditions. One example is CCCCAGC, which is significantly found upstream of overexpressed genes in 21 conditions (out of 30) of the human cell-cycle experiment . Other conserved elements are found much more often in upstream regions than in exons, for example, CCCCTCCC or TCTCGCGA, with ratios of 5.12 and 4.45, respectively. Others appear to be positionally constrained, for example, the palindromic CTGCGCA with an optimal window [0;300] and a median distance to ATG of 199, or constrained in terms of orientation, for example, GTGAGCCAC, which is significantly oriented 5' to 3' (p < 10-6).
To gain a better understanding of the network-level conservation of regulatory elements between the different phylogenetic groups, we compared the results we obtained by applying FastCompare to yeasts, worms, flies and mammals in the previous sections. We calculated the overlap (and its significance) of the 400 highest-scoring 7-mers and 8-mers found for each phylogenetic group. As shown in Table 6a,b, the number of shared predicted sites correlates with phylogenetic distance (the number of high-scoring putative motifs that two phylogenetic groups have in common decreases as the phylogenetic distance between the groups increases). All of the overlaps were found to be statistically significant, except for the yeast-human comparison. For both 7-mers and 8-mers, the best overlap is the one obtained between the two invertebrate phylogenetic groups: worms and flies. Indeed, simple observation of the identified known regulatory elements in Tables 2 and 4a reveals that these two organisms have a large number of predicted binding sites in common.
However, when we looked at the overlap between conserved sets for identical high-scoring k-mers in different phylogenetic groups (after determination of reciprocal best BLAST hits between the considered species), we found little overlap. The only significant overlap we found (after Bonferroni correction) was between the GATA sites (GATAAGA) in worm and fly (p = 2.5 × 10-4). As a control, we performed the same analysis within the yeast phylogenetic group, using the S. cerevisiae/S. bayanus and S. paradoxus/S. mikatae 400 most conserved 7-mers. One hundred and ninety-five sites were found in both groups of 7-mers, and for all of them, the overlaps between the conserved sets obtained separately in the S. cerevisiae/S. bayanus and S. paradoxus/S. mikatae analyses were highly significant, with hypergeometric p-values < 10-40. Therefore, our results strongly suggest that, while transcription factors have largely retained their ability to recognize specific DNA sites, their targets have largely changed through appearance or disappearance of those binding sites in promoters. This hypothesis is supported by recent analysis of the fission yeast cell cycle using microarrays, which showed that the role and the binding sites for several of the main transcription factors involved in regulating the yeast cell cycle (Swi4/Mbp1, Fkh1/Fkh2, Swi5/Ace2) are conserved between budding and fission yeasts (which diverged about 1 billion years ago), but the sets of genes that they regulate overlap much less than expected (only about 50 orthologous genes are cell-cycle-regulated in both species) .
It is particularly interesting to consider the seven 8-mers that are top predictions for all three multicellular phylogenetic groups (note that many more 7-mers are conserved between these groups). These sites include the CRE (TGACGTCA, GACGTCAC and ATGACGTC), the POU-domain binding site (ATTTGCAT), and the HRE (CAAGGTCA). A fourth site is also shared (GCCACGCC, CCACGCCC), which to the best of our knowledge is a novel motif. Its strong over-representation in upstream regions compared to coding regions, and its closeness to ATG (median DATG = 230 for GCCACGCC) make it a promising candidate for experimental testing. Interestingly, the location constraints on these conserved sites can vary across phylogenetic groups. For example, the CRE appears weakly constrained in worms and flies in terms of distance to ATG (DATG = 708 and 825, respectively), but is very close to ATG in mammalian genomes (DATG = 107). However, the distances to ATG of the POU-domain-binding sites (862, 882 and 729, respectively) indicate that their positional constraints are shared among the phylogenetic groups. The same holds for the HRE binding site (845, 1,015.5 and 895, respectively).
Discussion and conclusions
We have presented a powerful approach for discovering transcriptional regulatory elements that are globally conserved between pairs of genomes. Our approach requires only two unaligned genomes, thus allowing the use of genomes of arbitrary divergence and those with extensive rearrangements of noncoding regions. Moreover, our motif-finding strategy does not use any parameters other than a conservation score threshold, used to separate presumptive functional from nonfunctional motifs. We have shown that such thresholds can be roughly estimated using independent biological data, when available. Our approach is also computationally efficient: whole eukaryotic genomes can be processed in minutes on a typical computer. In turn, this efficiency allows FastCompare to explore exhaustive pattern lists.
Our results show that FastCompare can recover most of the known functional binding sites in S. cerevisiae when its upstream regions are compared to those of a related species, S. bayanus. We comprehensively explored the globally conserved motif content between worms, flies and mammalian genomes, discovering large sets of known and novel motifs. The use of external information (expression data, functional categories, TRANSFAC, chromatin IP and known motifs) clearly shows that our method is able to detect conserved and functional motifs in all the phylogenetic groups that we studied. In all analyses, we have shown that some of the discovered known or novel motifs were severely constrained, either in terms of position relative to the start of translation or in orientation. We also observed that some of the known or novel motifs are co-conserved within upstream regions, potentially revealing interactions between the (often unknown) transcription factors that bind them.
We have created a set of web tools to superimpose the most globally conserved k-mers discovered by FastCompare to user-supplied sequences or multiple alignments. An example is shown in Figure 10a, in which the upstream regions of the STE2 gene (encoding the alpha-factor pheromone receptor) from four different yeast species were aligned using ClustalW, and the most globally conserved k-mers are highlighted. All experimentally determined sites for STE2 were also predicted to be globally conserved by FastCompare. Moreover, several other sites also appear to be conserved, both at the global level (predicted by FastCompare) and the local level (shown by the multiple alignment). In Figure 10b, the same analysis was performed on only two orthologous upstream regions instead of four. Many more sites appear to be locally conserved than when using four species, but the globally conserved sites found by FastCompare allow the efficient selection of experimentally verified and putative binding sites. These tools should be particularly useful in designing stepwise promoter deletions and site-directed mutagenesis experiments for understanding the regulatory code of specific genes.
While powerful, our approach has potential limitations. Our current approach allows matches to a given k-mer to be on different strands within pairs of orthlogous upstream regions. This flexibility substantially increases the number of k-mers that are supported by independent biological data (that is, true positives), at least for yeasts and worms (data not shown). However, it is difficult to evaluate whether this flexibility introduces more true positives than false positives. Also, transcription factors often bind several slightly distinct sites with different affinities, and it is widely acknowledged that binding-site degeneracy is better captured by using position-weight matrices (PWM) instead of k-mers or consensus patterns . To evaluate whether weight matrices would display better conservation scores, we calculated a conservation score for weight matrices corresponding to 20 well characterized yeast binding sites, and compared them to the conservation scores obtained for the best k-mers that unambiguously correspond to the same binding sites. Conservation scores for weight matrices were calculated as described for k-mers in Materials and methods, except that we used the weight-matrix score thresholds that maximize the significance of the overlap between the two sets of ORFs containing matches to the weight matrices in each species. This involves progressively lowering the score threshold by small increments, and for each threshold, calculating the overlap and its hypergeometric p-value. We then choose the score threshold corresponding to the most significant p-value, and use the negative natural logarithm of this p-value as the conservation score. As shown in Table 7, only in 11 cases out of 20 did weight matrices have a higher conservation score than the corresponding k-mers. These results suggest that k-mers provide results that are almost as good as those obtained using weight matrices, when utilizing the network-level conservation criterion. One reason why, in many cases, k-mers have a higher conservation score than weight matrices may have to do with the more narrow selection of k-mers for binding sites with similar or identical affinities. In fact, we recently showed that PWM scores, widely seen as proxies for binding affinity, are statistically conserved in a comparison between S. cerevisiae and S. bayanus . In the context of the present study, the different k-mers representing each transcription factor binding site may be defining affinity classes that are more strongly conserved than a looser definition of a binding site represented by a weight matrix. Recent work in bacteria has established the importance of binding affinity, especially with respect to coordinating the temporal order of events .
However, Table 7 shows that the conservation score for weight matrices describing very degenerate binding sites, such as RAP1, is significantly higher than the conservation score obtained for the best corresponding k-mer. This suggest that our k-mer based approach is limited in its ability to discover highly degenerate binding sites.
As shown by our inter-group analysis, many regulatory elements have remained functional across evolution, but few have remained upstream of the same genes. The network-level conservation principle thus appears less applicable to species that diverged very long ago. For example, when we compared the Drosophila and mosquito genomes (which diverged approximately 400 million years ago), we only found a handful of k-mers (interestingly including GATA-factor and Myc/Max binding sites) to have conservation scores above those obtained from randomized data.
There are also several directions in which our approach could be extended. From a methodological standpoint, the approach could be extended to take into account local over-representation of identical or nearly identical copies of the same binding sites, a well known feature in the promoter regions of higher eukaryotic species . To discover highly degenerate regulatory elements, k-mers could be used to seed weight matrices whose individual weights could be optimized for network-level conservation, using stochastic optimization procedures (for example, simulated annealing; Mike Beer, personal communication). Introns and downstream noncoding regions could also be explored using our approach, as these regions are known to harbor functional regulatory elements in metazoan genomes. While our approach can deal with genomes presenting arbitrary levels of divergence and rearrangements, it would be interesting to investigate how global alignments or suboptimal and non-overlapping local alignments  could be used to filter out regions of non-conservation. This approach would be particularly interesting when analyzing very long upstream regions, in order to increase the signal-to-noise ratio. Finally, mRNA 3' UTRs could be compared in order to find specific downstream regulatory elements involved in post-transcriptional mRNA regulation (for example, mRNA localization, decay or translational repression).
Materials and methods
Outline of approach
First we determined orthology relationships between ORFs on the basis of reciprocal best BLAST hits (Figure 1a) and extracted the corresponding upstream regions from the genome sequences. Then, we considered every possible short DNA sequence of length k (k-mer, with k between 7 and 9) as a candidate regulatory element. For each k-mer, we found the set of ORFs whose upstream regions contain at least one exact match to the k-mer, anywhere in the upstream region, in the first genome. We did the same for the second genome, obtaining another set of ORFs. Then, we calculated the overlap between the two sets and assessed its statistical significance (Figure 1b). The statistical significance of the overlap provides a measure of conservation with which we score and rank every possible k-mer (Figure 1c). Note that our approach is very different from the classical k-tuple DNA sequence-analysis methods [77, 78], which are not based on comparative genomics and are local methods; that is, they only deal with single promoters or small sets of functionally related promoters (while our approach provides a genome-level measure of conservation for candidate regulatory elements).
Sequence sources and orthology determination
Sequence data were downloaded from the Saccharomyces Genome Database (SGD) for all yeast species considered in this paper; worm (C. elegans and C. briggsae), Drosophila (D. melanogaster), human (H. sapiens) and mouse (M. musculus) sequence data were downloaded from Ensembl . The D. pseudoobscura genome sequences (contigs) were downloaded from . The upstream regions used in this study are immediately adjacent to the ATG codons of their downstream genes, and are 1-kb long (yeasts) or 2-kb long (worms, flies and mammals). Note that transcription-factor-binding sites generally reside in the region situated upstream of the transcription start site. Unfortunately, not all genes have well annotated transcription start sites. This problem should not, however, strongly influence the output of FastCompare, as distances between start of transcription and start of translation should be at most on the order of a few hundred base-pairs (except in certain cases, for example when 5' UTRs are interrupted by long introns). However, as gene structures become better annotated (mainly as a result of massive cDNA sequencing projects) and promoter regions become more accurately delimited, we expect that the ability of FastCompare to discover regulatory elements will be significantly improved.
Orthology information provided by Ensembl or by Kellis et al.  was used throughout this study, when available. Ensembl provides strong homology relationships between genes from different species, but does not provide reciprocal best matches. Therefore, we determine reciprocal best matches using the provided sequence identity between homologous genes. When orthology information is not available in Ensembl (for example, between D. melanogaster and D. pseudoobscura, or between distant species such as S. cerevisiae and C. elegans), we determine orthologs using the reciprocal best BLAST hits approach.
Motif-finding algorithm and simple clustering
Given a value of k, we first generated the set of all possible k-mers and removed half of them on the basis of reverse complementarity. We also removed k-mers with very low complexity and which are over-abundant in the intergenic regions of the genomes we analyzed (that is, those that contain k - 1 or more As or Ts), as these sequences are unlikely to be regulatory elements. Every remaining k-mer (that is, 8,170 for k = 7) is then considered as a candidate regulatory element. For each k-mer, we found the set of ORFs in the first species that have at least one exact occurrence of the k-mer in their upstream regions. We then found the set of ORFs in the second species that have at least one occurrence of the same k-mer in their upstream region. Importantly, the matches can be anywhere in the upstream regions: they do not have to be at the same positions in two orthologous upstream regions (as with multiple alignment) and can be on any strand. Since both functional and non-functional elements are expected to be conserved between two closely related species, the two sets are expected to overlap. However, under the network-level conservation principle, the extent of the overlap - and therefore its statistical significance - will be even greater for k-mers that represent functional transcription factor binding sites. The significance of the overlap can be measured using the hypergeometric distribution. The probability of two sets of size s1 and s2, drawn from a set of N elements, to have i or more elements in common is given by :
In this way, all k-mers can be ranked by their hypergeometric p-values. It is important to note that due to basal conservation (that is, conservation arising from common ancestry), the hypergeometric p-values will generally be very small for most k-mers. Therefore, we only use these p-values as relative measures of network-level conservation and focus on k-mers with the greatest conservation. For simplicity, we define the 'conservation score' to be the negative logarithm (base e) of the hypergeometric p-value obtained for a given k-mer. Therefore, the more extensive the overlap between the two sets, the higher the conservation score. Also, for the same k-mer, we call 'conserved set' the set of ORFs corresponding to the overlap between the two sets of orthologous ORFs containing at least one exact match to the k-mer in their upstream regions. Conserved sets are used throughout this study to get insights into the function of the most conserved k-mers, using functional annotation [81, 82], chromatin IP , known motifs, and to evaluate whether these k-mers are constrained in terms of position or orientation.
The current FastCompare implementation handles k-mers with a user-specified gap (termed gapped k-mers), which is a straightforward extension of the approach described above. The conservation score returned by FastCompare is independent of the size of the patterns (that is, the value of k); therefore k-mers with different sizes, and gapped k-mers (for example, CGTNNNNNNTGA) can be compared.
We use the following strategy when applying FastCompare to pairs of genomes. First, we calculate conservation scores for all 7-mers, 8-mers and 9-mers. We then retain only the m highest-scoring 7-mers, with m chosen according to independent biological data (alternatively, m could be chosen according to the estimated number of transcription factors in the species being considered). We then replace each of the retained 7-mers by an 8-mer (if there is one) with higher conservation score for which the considered 7-mer is a substring. We also include within the final list the 8-mers which do not have any substrings within the m 7-mers. We then repeat the same process for the retained 8-mers, replacing each of them by its higher scoring 9-mer superstring if there is one, and add the 9-mers that do not have any substring within the 8-mers. This strategy thus allows the optimal length for candidate regulatory elements to be determined.
FastCompare is implemented in the C language and uses efficient data structures (hash tables and prefix trees ). For a given value of k, the worst-case time complexity is O(kn + 4k(p + k)), where n is the total amount of upstream sequences and p is the total number of orthologous pairs. Note that the first term is generally much larger than the second one; therefore the complexity of our approach can be seen as linear in the combined sizes of the genomes to be compared (when k is restricted to 7, 8 and 9). The calculation of hypergeometric p-values involves factorials of large integers, so we use specialized C routines, as described in . FastCompare runtimes provided in the Results section are obtained using a standard desktop PC (2.0 GHz CPU, 1 GB RAM).
Discovering positional constraints for conserved regulatory elements
As described in Results, we applied FastCompare to 1 kb (yeast) or 2 kb upstream regions (worms, flies and mammals). While these lengths are reasonable, they are somewhat arbitrary, and it is known that some regulatory elements are constrained to be within specific distances (often shorter than 1 kb) from the start of transcription, reflecting mechanistic constraints for transcription factor-transcription factor or transcription factor-RNA polymerase interactions . Moreover, some regulatory elements have orientation biases (see [11, 12] for examples). To discover such constraints, we analyzed the most conserved k-mers found at the previous stage in the following ways.
First, for each high-scoring k-mer, we calculated the median distance to ATG (as the start of transcription is generally not known) for the set of all (non-overlapping) occurrences of this k-mer within the upstream regions of its conserved set (see previous section for a definition of the conserved set of a given k-mer). To statistically assess whether the median distance to ATG for a given k-mer is unusually small or large, we built the distribution P(d) of median distances to ATG, for the entire set of 8,170 7-mers. We first created a histogram by binning the median distances to ATG for all 7-mers into 20-bp bins, and then smoothed the histogram (using a normal kernel and a bandwidth of 50 as implemented in the ksmooth function of the R statistical software package). Then, using numerical integration, we sought the distance thresholds d0.025 and d0.975 such that P(d <d0.025) = 0.025 and P(d <d0.975) = 0.975. We then considered the median distance to ATG for a given k-mer as unusually short or long when it is less than d0.025 or greater than d0.975, respectively.
For each k-mer, we also sought the sequence window which maximizes the conservation score by progressively shortening all upstream regions (all having equal lengths) by 100 bp increments from the 5' end. Then, we did the same from the 3' end using the optimal 5' end found in the previous step. Evaluating every possible window whose length is a multiple of 100 bp almost always yields identical results. We then calculated the conserved sets for these windows, and output the orientation (strand) for each k-mer occurrence within its conserved set (palindromes were counted on both strand).
Finally, using the results of the previous step, for each k-mer, we used the binomial distribution to assess whether the proportion of occurrences of this k-mer (within its conserved sets) on one strand is significantly smaller (or larger) than 0.5. Binomial p-values less than 0.05 (after Bonferroni correction) are considered significant.
It is now known that the regulatory code governing the expression of genes is combinatorial [11, 85, 86]. The network-level conservation principle can be trivially extended to discover interactions (that is, co-conservation) between two k-mers. To focus on heterotypic interactions, we only examined k-mers that differ by more than l nucleotides, after optimal ungapped alignment. We tested several values of l and found that l = 4 was most appropriate when using 7-, 8- and 9-mers. Then, we proceeded as described above, except that instead of seeking two sets of ORFs (one for each species) whose upstream regions contain a single k-mer, we sought the two sets of ORFs that contain the two k-mers simultaneously. Once these two sets were available, we evaluated the extent of their overlap as described above, and rank interaction pairs according to their conservation score.
We used randomized data to show that high conservation scores (obtained as described above) are unlikely to be obtained by chance, and independent biological information to assess the ability of FastCompare to predict functional regulatory elements by giving them a high conservation score. We also estimated the over-representation of predicted regulatory elements in upstream regions compared to coding regions.
Validation using randomized data
Our goal was to generate new pairs of upstream regions that are conserved at the same level of divergence as the actual sequence data. We align each pair of orthologous sequences using the Needleman-Wunsch algorithm , and calculate substitution frequencies between all pairs of nucleotides (A → A, A → T, and so on). Then, we reconstructed new pairs of orthologous sequences by mutating one of the sequences in each initial pair using the estimated frequencies. Generating the sequences to be mutated using locally estimated first-order Markov models yielded the same results.
Validation using independent biological information
The proportions of 7-mers supported by each type of independent data, as presented in Figures 3, 5, 7 and 8, is calculated as follows. In these figures, support for a given 7-mer is considered as binary, and depends on whether the 7-mer meets the particular validation criterion or not (or whether it is found upstream of over- or underexpressed genes, in at least one microarray condition, see below). 7-mers are first sorted by conservation score, and the proportion of supported 7-mers were calculated using a sliding window of 100 7-mers. For each window and each type of independent biological data, we simply calculated the number of 7-mers for which support is available and divided this number by 100.
Yeast (S. cerevisiae), worm (C. elegans), fly (D. melanogaster) and human (H. sapiens) functional categories and corresponding ORF annotations were downloaded from the MIPS  and GO  websites. The statistical significance of the functional enrichments within sets of ORFs was evaluated using the hypergeometric distribution, as discussed above. Hypergeometric p-values for functional enrichment were not corrected for multiple testing, but only p-values smaller than 10-4 are reported, providing a slightly less stringent thresholds than Bonferroni corrections.
Weight matrices corresponding to known yeast motifs were obtained from Gibbs sampling-based motif finding on chromatin IP data , functional categories and clusters of co-expressed genes . Only high-confidence binding sites (that is, sites confirmed by several sources including the literature) were included in our list of known motifs. We label a given k-mer as a known motif if it meets the following two criteria. The first is significant overlap (p < 10-4) between the conserved set for the given k-mer and the set of ORFs whose upstream regions contain at least one match to the known motif (the sets of ORFs were defined using ScanACE with the weight matrix for the known motif, and with the standard average minus two standard deviations threshold ). The second criterion is strong sequence similarity between the considered k-mer and the known motif weight matrix. To evaluate this similarity, we turn the considered k-mer into a weight matrix of 0s and 1s, and use CompareACE  to calculate the Pearson correlation between the weights of this matrix and the weights of the known motif weight matrix; correlation coefficients > 0.65 are considered significant. Finally, for a given k-mer, we report the known motif for which the above hypergeometric p-value is the smallest.
In vivobinding data (chromatin IP)
Genome-wide binding locations were previously evaluated for 106 transcription factors in S. cerevisiae . For each transcription factor, we retain the set of ORFs with p-value < 0.001 (see  for details of the error model). To evaluate a given k-mer with respect to chromatin IP, we evaluate the statistical significance of the overlap between the conserved set of the considered k-mer and the set of ORFs defined as described above corresponding to each transcription factor. We report the most significant chromatin IP enrichment, with hypergeometric p-value < 10-4.
The 309 weight matrices and corresponding consensus patterns for known transcription factor binding sites were downloaded from [90, 91]. k-mers were then simply matched to the consensus patterns. We eliminated consensus patterns that match too many k-mers, by matching each of them to all (8,170) 7-mers and removing consensus patterns that matched more than 50 7-mers.
Microarray expression data
Expression data for all species considered were downloaded from diverse sources [92, 93]. Overall, we downloaded 765 microarray conditions for S. cerevisiae, 555 conditions for C. elegans, 156 conditions for D. melanogaster, and 1,384 conditions for H. sapiens. We use these expression data in the following way.
We evaluated the over-representation of each k-mer in the upstream regions of genes that are themselves over- or underexpressed in certain microarray conditions. Over- or underexpressed genes are operationally defined as having a log ratio of intensity above average plus two standard deviations, or below average minus two standard deviations, respectively (averages and standard deviations are calculated for each condition; using fold changes instead of standard deviations produced roughly the same results). To evaluate the over-representation of a given k-mer in a given microarray condition, we defined as O1 the set of overexpressed genes in this condition, and as O2 the set of ORFs whose upstream regions contain at least one occurrence of the considered k-mer, genome-wide. Then, we evaluated the significance of the overlap between O1 and O2 using the hypergeometric distribution, as described above. Overlaps whose hypergeometric p-value is smaller than 0.05 (after Bonferroni correction) were considered significant. We proceeded separately with the set of underexpressed genes in the same way. The total number of microarray conditions (overexpressed plus underexpressed) for which a k-mer was found to be significantly over-represented is reported. Note that we do not use the conserved set for the considered k-mer here, as we do not want to restrict our analysis to orthologous genes. Indeed, except for yeast, microarrays often contain only a fraction of all genes within the considered organism. In these cases, the overlap between conserved sets and over- or underexpressed genes can be very small, reducing statistical power. Using all genes, therefore, increases our power to detect significant associations, while retaining a uniform approach for all species considered.
Over-representation in upstream regions compared to coding regions
As shown in  for the yeast RAP1 transcription factor, some transcription factors bind intergenic regions much more frequently than they bind coding regions. Consequently, it is expected that sequences corresponding to regulatory elements are more often present in intergenic regions than in coding regions. To evaluate this bias, we calculate the ratio of the number of genes that have the k-mer in their upstream regions over the number of genes that have the k-mer in their coding regions (using only exons), and we correct this ratio using the average length of the upstream and coding regions.
The FastCompare implementation, all the sequences, and results are available on our website .
Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al: Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002, 298: 799-804. 10.1126/science.1075090.
Stormo GD: DNA binding sites: representation and discovery. Bioinformatics. 2000, 16: 16-23. 10.1093/bioinformatics/16.1.16.
Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M: Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science. 2003, 301: 71-76. 10.1126/science.1084337.
Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003, 423: 241-254. 10.1038/nature01644.
Aparicio S, Morrison A, Gould A, Gilthorpe J, Chaudhuri C, Rigby P, Krumlauf R, Brenner S: Detecting conserved regulatory elements with the model genome of the Japanese puffer fish, Fugu rubripes. Proc Natl Acad Sci USA. 1995, 92: 1684-1688.
Pritsker M, Liu YC, Beer MA, Tavazoie S: Whole-genome discovery of transcription factor binding sites using network-level conservation. Genome Res. 2004, 14: 99-108. 10.1101/gr.1739204.
Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis -regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000, 296: 1205-1214. 10.1006/jmbi.2000.3519.
Zhu J, Zhang MQ: SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics. 1999, 15: 607-611. 10.1093/bioinformatics/15.7.607.
Yamaguchi-Iwai Y, Dancis A, Klausner RD: AFT1: a mediator of iron regulated transcriptional control in Saccharomyces cerevisiae. EMBO J. 1995, 14: 1231-1239.
Beer MA, Tavazoie S: Predicting gene expression from sequence. Cell. 2004, 117: 185-198. 10.1016/S0092-8674(04)00304-6.
Erives A, Levine M: Coordinate enhancers share common organizational features in the Drosophila genome. Proc Natl Acad Sci USA. 2004, 101: 3851-3856. 10.1073/pnas.0400611101.
Sudarsanam P, Pilpel Y, Church GM: Genome-wide co-occurrence of promoter elements reveals a cis-regulatory cassette of rRNA transcription motifs in Saccharomyces cerevisiae. Genome Res. 2002, 12: 1723-1731. 10.1101/gr.301202.
Blaiseau PL, Thomas D: Multiple transcriptional activation complexes tether the yeast activator Met4 to DNA. EMBO J. 1998, 17: 6327-6336. 10.1093/emboj/17.21.6327.
Chiang DY, Moses AM, Kellis M, Lander ES, Eisen MB: Phylogenetically and spatially conserved word pairs associated with gene-expression changes in yeasts. Genome Biol. 2003, 4: R43-10.1186/gb-2003-4-7-r43.
Davidson EH: Genomic Regulatory Systems. 2001, San Diego, CA: Academic Press
Coghlan A, Wolfe KH: Fourfold faster rate of genome rearrangement in nematodes than in Drosophila. Genome Res. 2002, 12: 857-867. 10.1101/gr.172702.
Maduro MF, Rothman JH: Making worm guts: the gene regulatory network of the Caenorhabditis elegans endoderm. Dev Biol. 2002, 246: 68-85. 10.1006/dbio.2002.0655.
Cui M, Han M: Cis regulatory requirements for vulval cell-specific expression of the Caenorhabditis elegans fibroblast growth factor gene egl-17. Dev Biol. 2003, 257: 104-116. 10.1016/S0012-1606(03)00033-2.
Gaudet J, Mango SE: Regulation of organogenesis by the Caenorhabditis elegans FoxA protein PHA-4. Science. 2002, 295: 821-825. 10.1126/science.1065175.
Maduro MF, Meneghini MD, Bowerman B, Broitman-Maduro G, Rothman JH: Restriction of mesendoderm to a single blastomere by the combined action of SKN-1 and a GSK-3 homolog is mediated by MED-1 and -2 in C. elegans. Mol Cell. 2001, 7: 475-485. 10.1016/S1097-2765(01)00195-2.
Harfe BD, Fire A: Muscle and nerve-specific regulation of a novel NK-2 class homeodomain factor in Caenorhabditis elegans. Development. 1998, 125: 421-429.
Jantsch-Plunger V, Fire A: Combinatorial structure of a body muscle-specific transcriptional enhancer in Caenorhabditis elegans. J Biol Chem. 1994, 269: 27021-27028.
Tsukiyama T, Becker PB, Wu C: ATP-dependent nucleosome disruption at a heat-shock promoter mediated by binding of GAGA transcription factor. Nature. 1994, 367: 525-532. 10.1038/367525a0.
King-Jones K, Korge G, Lehmann M: The helix-loop-helix proteins dAP-4 and daughterless bind both in vitro and in vivo to SEBP3 sites required for transcriptional activation of the Drosophila gene Sgs-4. J Mol Biol. 1999, 291: 71-82. 10.1006/jmbi.1999.2963.
Krause M, Fire A, Harrison SW, Priess J, Weintraub H: CeMyoD accumulation defines the body wall muscle cell fate during C. elegans embryogenesis. Cell. 1990, 63: 907-919. 10.1016/0092-8674(90)90494-Y.
Hu YF, Luscher B, Admon A, Mermod N, Tjian R: Transcription factor AP-4 contains multiple dimerization domains that regulate dimer specificity. Genes Dev. 1990, 4: 1741-1752.
Blackwell TK, Weintraub H: Differences and similarities in DNA-binding preferences of MyoD and E2A protein complexes revealed by binding site selection. Science. 1990, 250: 1104-1110.
Krause M, Park M, Zhang J, Yuan J, Harfe B, Xu S, Greenwald I, Cole M, Paterson B, Fire A: A C. elegans E/Daughterless bHLH protein marks neuronal but not striated muscle development. Development. 1997, 124: 2179-2189.
Furuyama T, Nakazawa T, Nakano I, Mori N: Identification of the differential distribution patterns of mRNAs and consensus binding sequences for mouse DAF-16 homologues. Biochem J. 2000, 349: 629-634. 10.1042/0264-6021:3490629.
Murphy CT, McCarroll SA, Bargmann CI, Fraser A, Kamath RS, Ahringer J, Li H, Kenyon C: Genes that act downstream of DAF-16 to influence the lifespan of Caenorhabditis elegans. Nature. 2003, 424: 277-283. 10.1038/nature01789.
Lee SS, Kennedy S, Tolonen AC, Ruvkun G: DAF-16 target genes that control C. elegans life-span and metabolism. Science. 2003, 300: 644-647. 10.1126/science.1083614.
Gronostajski RM: Analysis of nuclear factor I binding to DNA using degenerate oligonucleotides. Nucleic Acids Res. 1986, 14: 9117-9132.
Lee W, Mitchell P, Tjian R: Purified transcription factor AP-1 interacts with TPA-inducible enhancer elements. Cell. 1987, 49: 741-752. 10.1016/0092-8674(87)90612-X.
Kockel L, Homsy J, Bohmann D: Drosophila AP-1: lessons from an invertebrate. Oncogene. 2001, 20: 2347-2364. 10.1038/sj.onc.1204300.
Karin M, Liu Z, Zandi E: AP-1 function and regulation. Curr Opin Cell Biol. 1997, 9: 240-246. 10.1016/S0955-0674(97)80068-3.
Grandori C, Cowley SM, James LP, Eisenman RN: The Myc/Max/Mad network and the transcriptional control of cell behavior. Annu Rev Cell Dev Biol. 2000, 16: 653-699. 10.1146/annurev.cellbio.16.1.653.
Rice DA, Mouw AR, Bogerd AM, Parker KL: A shared promoter element regulates the expression of three steroidogenic enzymes. Mol Endocrinol. 1991, 5: 1552-1561.
Ueda H, Sun GC, Murata T, Hirose S: A novel DNA-binding motif abuts the zinc finger domain of insect nuclear hormone receptor FTZ-F1 and mouse embryonal long terminal repeat-binding protein. Mol Cell Biol. 1992, 12: 5667-5672.
Shaywitz AJ, Greenberg ME: CREB: a stimulus-induced transcription factor activated by a diverse array of extracellular signals. Annu Rev Biochem. 1999, 68: 821-861. 10.1146/annurev.biochem.68.1.821.
Dijk MAV, Voorhoeve PM, Murre C: Pbx1 is converted into a transcriptional activator upon acquiring the N-terminal region of E2A in pre-B-cell acute lymphoblastoid leukemia. Proc Natl Acad Sci U S A. 1993, 90: 6061-6065.
Manak JR, Mathies LD, Scott MP: Regulation of a decapentaplegic midgut enhancer by homeotic proteins. Development. 1994, 120: 3605-3619.
Mauhin V, Lutz Y, Dennefeld C, Alberga A: Definition of the DNA-binding site repertoire for the Drosophila transcription factor SNAIL. Nucleic Acids Res. 1993, 21: 3951-3957.
Huber HE, Edwards G, Goodhart PJ, Patrick DR, Huang PS, Ivey-Hoyle M, Barnett SF, Oliff A, Heimbrook DC: Transcription factor E2F binds DNA as a heterodimer. Proc Natl Acad Sci U S A. 1993, 90: 3525-3529.
Boxem M, vanden Heuvel S: C. elegans class B synthetic multivulva genes act in G(1) regulation. Curr Biol. 2002, 12: 906-911. 10.1016/S0960-9822(02)00844-8.
Ceol CJ, Horvitz HR: dpl-1 DP and efl-1 E2F act with lin-35 Rb to antagonize Ras signaling in C. elegans vulval development. Mol Cell. 2001, 7: 461-473. 10.1016/S1097-2765(01)00194-0.
Kwon JY, Hong M, Choi MS, Kang S, Duke K, Kim S, Lee S, Lee J: Ethanol-response genes and their regulation analyzed by a microarray and comparative genomic approach in the nematode Caenorhabditis elegans. Genomics. 2004, 83: 600-614. 10.1016/j.ygeno.2003.10.008.
Lund J, Tedesco P, Duke K, Wang J, Kim SK, Johnson TE: Transcriptional profile of aging in C. elegans. Curr Biol. 2002, 12: 1566-1573. 10.1016/S0960-9822(02)01146-6.
Ohler U, Yekta S, Lim LP, Bartel DP, Burge CB: Patterns of flanking sequence conservation and a characteristic upstream motif for microRNA gene identification. RNA. 2004, 10: 1309-1322. 10.1261/rna.5206304.
Celniker SE, Rubin GM: The Drosophila melanogaster genome. Annu Rev Genomics Hum Genet. 2003, 4: 89-117. 10.1146/annurev.genom.4.070802.110323.
Matsukage A, Hirose F, Hayashi Y, Hamada K, Yamaguchi M: The DRE sequence TATCGATA, a putative promoter-activating element for Drosophila melanogaster cell-proliferation-related genes. Gene. 1995, 166: 233-236. 10.1016/0378-1119(95)00586-2.
Choi T, Cho N, Oh Y, Yoo M, Matsukage A, Ryu Y, Han K, Yoon J, Baek K: The DNA replication-related element (DRE)-binding factor (DREF) system may be involved in the expression of the Drosophila melanogaster TBP gene. FEBS Lett. 2000, 483: 71-77. 10.1016/S0014-5793(00)02085-8.
Park SY, Kim YS, Yang DJ, Yoo MA: Transcriptional regulation of the Drosophila catalase gene by the DRE/DREF system. Nucleic Acids Res. 2004, 32: 1318-1324. 10.1093/nar/gkh302.
Hanes SD, Brent R: A genetic model for interaction of the homeodomain recognition helix with DNA. Science. 1991, 251: 426-430.
Anderson MG, Perkins GL, Chittick P, Shrigley RJ, Johnson WA: Drifter, a Drosophila POU-domain transcription factor, is required for correct differentiation and migration of tracheal cells and midline glia. Genes Dev. 1995, 9: 123-137.
Bhat KM, Poole SJ, Schedl P: The miti-mere and pdm1 genes collaborate during specification of the RP2/sib lineage in Drosophila neurogenesis. Mol Cell Biol. 1995, 15: 4052-4063.
Junger MA, Rintelen F, Stocker H, Wasserman JD, Vegh M, Radimerski T, Greenberg ME, Hafen E: The Drosophila Forkhead transcription factor FOXO mediates the reduction in cell number associated with reduced insulin signaling. J Biol. 2003, 2: 20-10.1186/1475-4924-2-20.
Erickson JW, Cline TW: Key aspects of the primary sex determination mechanism are conserved across the genus Drosophila. Development. 1998, 125: 3259-3268.
Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-562. 10.1038/nature01262.
Suske G: The Sp-family of transcription factors. Gene. 1999, 238: 291-300. 10.1016/S0378-1119(99)00357-1.
Ramji DP, Foka P: CCAAT/enhancer-binding proteins: structure, function and regulation. Biochem J. 2002, 365: 561-575.
Latchman D: Eukaryotic Transcription Factors. 1997, London: Academic Press
Vo N, Goodman RH: CREB-binding protein and p300 in transcriptional regulation. J Biol Chem. 2001, 276: 13505-13508.
Bernards R: Transcriptional regulation. Flipping the Myc switch. Curr Biol. 1995, 5: 859-861. 10.1016/S0960-9822(95)00173-4.
Nasrin N, Ercolani L, Denaro M, Kong XF, Kang I, Alexander M: An insulin response element in the glyceraldehyde-3-phosphate dehydrogenase gene binds a nuclear protein induced by insulin in cultured cells and by nutritional manipulations in vivo. Proc Natl Acad Sci U S A. 1990, 87: 5273-5277.
Suzuki F, Goto M, Sawa C, Ito S, Watanabe H, Sawada J, Handa H: Functional interactions of transcription factor human GA-binding protein subunits. J Biol Chem. 1998, 273: 29302-29308. 10.1074/jbc.273.45.29302.
Zimmermann AG, Wright KL, Ting JP, Mitchell BS: Regulation of inosine-5'-monophosphate dehydrogenase type II gene expression in human T cells. Role for a novel 5' palindromic octamer sequence. J Biol Chem. 1997, 272: 22913-22923. 10.1074/jbc.272.36.22913.
Gottlieb S, Hanes SD, Golden JA, Oakey RJ, Budarf ML: Goosecoid-like, a gene deleted in DiGeorge and velocardiofacial syndromes, recognizes DNA with a bicoid-like specificity and is expressed in the developing mouse brain. Hum Mol Genet. 1998, 7: 1497-1505. 10.1093/hmg/7.9.1497.
Singh H, Sen R, Baltimore D, Sharp PA: A nuclear factor that binds to a conserved sequence motif in transcriptional control elements of immunoglobulin genes. Nature. 1986, 319: 154-158. 10.1038/319154a0.
Nie Z, Mei Y, Ford M, Rybak L, Marcuzzi A, Ren H, Stiles GL, Ramkumar V: Oxidative stress increases A1 adenosine receptor expression by activating nuclear factor kappa B. Mol Pharmacol. 1998, 53: 663-669.
Glasgow JN, Wood T, Perez-Polo JR: Identification and characterization of nuclear factor κB binding sites in the murine bcl-x promoter. J Neurochem. 2000, 75: 1377-1389. 10.1046/j.1471-4159.2000.0751377.x.
Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell. 2002, 13: 1977-2000. 10.1091/mbc.02-02-0030..
Rustici G, Mata J, Kivinen K, Lio P, Penkett CJ, Burns G, Hayles J, Brazma A, Nurse P, Bahler J: Periodic gene expression program of the fission yeast cell cycle. Nat Genet. 2004, 36: 809-817. 10.1038/ng1377.
Stormo GD, Fields DS: Specificity, free energy and information content in protein-DNA interactions. Trends Biochem Sci. 1998, 23: 109-113. 10.1016/S0968-0004(98)01187-6.
Kalir S, Alon U: Using a quantitative blueprint to reprogram the dynamics of the flagella gene network. Cell. 2004, 117: 713-720. 10.1016/j.cell.2004.05.010.
Waterman MS, Eggert M: A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. J Mol Biol. 1987, 197: 723-728. 10.1016/0022-2836(87)90478-5.
Wolfertstetter F, Frech K, Herrmann G, Werner T: Identification of functional elements in unaligned nucleic acid sequences by a novel tuple search algorithm. Comput Appl Biosci. 1996, 12: 71-80.
Zhang MQ: Identification of human gene core promoters in silico. Genome Res. 1998, 8: 319-326.
Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SM, Clamp M: The ENSEMBL automatic gene annotation system. Genome Res. 2004, 14: 942-950. 10.1101/gr.1858004.
Human Genome Sequencing Center at Baylor College of Medicine: Drosophila genome project. [http://www.hgsc.bcm.tmc.edu/projects/drosophila]
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.
Mewes HW, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt G, Munsterkotter M, Pagel P, Strack N, Stumpflen V, et al: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004, D41-D44. 10.1093/nar/gkh092. 32 Database
Gusfield D: Algorithms on Strings, Trees, and Sequences. 1997, Cambridge, UK: Cambridge University Press
Press WH, Flannery BP, Teukolsky SA, Vetterling WT: Numerical Recipes in C: The Art of Scientific Computing. 1993, Cambridge, UK: Cambridge University Press
Pilpel Y, Sudarsanam P, Church GM: Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet. 2001, 29: 153-159. 10.1038/ng724.
Yuh CH, Bolouri H, Davidson EH: Genomic cis -regulatory logic: experimental and computational analysis of a sea urchin gene. Science. 1998, 279: 1896-1902. 10.1126/science.279.5358.1896.
Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48: 443-453.
Comprehensive yeast genome database. [http://mips.gsf.de/proj/yeast/catalogues/funcat]
Gene Ontology. [http://www.geneontology.org]
Matys V, Fricke E, Geffers R, Gössling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, et al: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003, 31: 374-378. 10.1093/nar/gkg108.
Gollub J, Ball CA, Binkley G, Demeter J, Finkelstein DB, Hebert JM, Hernandez-Boussard T, Jin H, Kaloper M, Matese JC, et al: The Stanford Microarray Database: data access and quality assessment tools. Nucleic Acids Res. 2003, 31: 94-96. 10.1093/nar/gkg078.
Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003, 302: 249-255. 10.1126/science.1087447.
Lieb JD, Liu X, Botstein D, Brown PO: Promoter-specific binding of Rap1 revealed by genome-wide maps of protein-DNA association. Nat Genet. 2001, 28: 327-334. 10.1038/ng569.
Balasubramanian B, Lowry CV, Zitomer RS: The Rox1 repressor of the Saccharomyces cerevisiae hypoxic genes is a specific DNA-binding protein with a high-mobility-group motif. Mol Cell Biol. 1993, 13: 6071-6078.
Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000, 11: 4241-4257.
We thank David Stern, Mike Beer, Chang Chan, Yir-Chung Liu and two anonymous reviewers for providing helpful comments on the manuscript, Mike Beer for providing weight matrices for known transcription factors in yeast and the other members of the Tavazoie laboratory for helpful discussions. S.T. is supported in part by grants from NSF CAREER, DARPA, and NIH.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.