Phylogenetic simulation of promoter evolution: estimation and modeling of binding site turnover events and assessment of their impact on alignment tools
© Huang et al; licensee BioMed Central Ltd. 2007
Received: 11 April 2007
Accepted: 24 October 2007
Published: 24 October 2007
The phenomenon of functional site turnover has important implications for the study of regulatory region evolution, such as for promoter sequence alignments and transcription factor binding site (TFBS) identification. At present, it remains difficult to estimate TFBS turnover rates on real genomic sequences, as reliable mappings of functional sites across related species are often not available. As an alternative, we introduce a flexible new simulation system, Phylogenetic Simulation of Promoter Evolution (PSPE), designed to study functional site turnovers in regulatory sequences.
Using PSPE, we study replacement turnover rates of different individual TFBSs and simple modules of two sites under neutral evolutionary functional constraints. We find that TFBS replacement turnover can happen rapidly in promoters, and turnover rates vary significantly among different TFBSs and modules. We assess the influence of different constraints such as insertion/deletion rate and translocation distances. Complementing the simulations, we give simple but effective mathematical models for TFBS turnover rate prediction. As one important application of PSPE, we also present a first systematic evaluation of multiple sequence aligners regarding their capability of detecting TFBSs in promoters with site turnovers.
PSPE allows researchers for the first time to investigate TFBS replacement turnovers in promoters systematically. The assessment of alignment tools points out the limitations of current approaches to identify TFBSs in non-coding sequences, where turnover events of functional sites may happen frequently, and where we are interested in assessing the similarity on the functional level. PSPE is freely available at the authors' website.
Transcription regulation is a central component in the control of gene expression. Identification of functional cis-elements in promoter regions, a key to understanding gene regulation, has turned out to be a difficult task thus far. With the increasing availability of genome sequences, phylogenetic footprinting appeared to offer a very promising approach for identifying cis-elements [1, 2]. One essential assumption of phylogenetic footprinting is sequence conservation of functionally homologous genes. While such an assumption has been frequently found to be true for protein encoding sequences, there is no straightforward relationship of conservation between sequence and function for non-protein-coding regulatory sequences [3, 4].
Compared to protein-coding regions, transcriptional promoter regions are subject to much less stringent selection and have higher nucleotide substitution rates, where short transcription factor binding sites can easily turn over and be replaced by new ones arising from random mutations [5, 6]. In many cases, the function of a regulatory sequence may, however, remain well conserved despite substantial sequence changes. One of the best-studied examples is the even-skipped enhancer system S2E of Drosophila species, which is highly conserved at the functional level (for example, maintaining a high similarity of expression pattern) but substantially diverged at the sequence level. Such sequence divergence includes large insertions and deletions between different sites, substitutions within sites, and gains and losses of sites. Several experimental studies suggested that compensatory mutations in the even-skipped enhancer region are the key to maintain the functionality of the enhancer in evolution [7–9]. Estimates of transcription factor binding site (TFBS) turnover rates rank as high as 32-40% between human and rodent species , and can also happen at transcription start sites (TSSs) of orthologous genes , albeit at a lower frequency. The phenomenon of TFBS turnovers in regulatory regions suggest that any phylogenetic footprinting methods based on a simple trace of the evolution of nucleotides can be highly effective in some cases, but are unlikely to be able to identify all functionally important elements in regulatory genomic sequences, particularly in distantly related species. In this sense, a major improvement in TFBS identification will rely on a better understanding of evolutionary mechanisms regarding TFBS turnover events.
While TFBS turnover has been known for a long time, it has not become a widely studied topic until recently, when the availability of related genome sequences made it amenable to systematic studies [11–13]. With our currently limited knowledge about their structure and functional constraints, it is much more challenging to study the evolution of regulatory sequences than of protein-coding sequences. Most published experimental studies have been conducted on a gene-by-gene and element-by-element basis, and computational studies on real data are severely limited by the available functional site mapping data. In the absence of real biological data, computational simulation may provide the best way to study TFBS evolution and turnover in a systematic way. A pioneering simulation of TFBS evolution estimated the expected time for new binding sites to arise from point mutations in promoter regions, where binding sites were represented by simple consensus sequences, and promoters were evolved under a neutral evolution model . A recent study examined the expected time for a new site to evolve and become fixed in a population by positive selection, where the authors considered effective population size and used position weight matrices (PWMs) to model TFBSs . The study found that the existence and location of pre-sites of functional sites could be major factors determining the expected time and location of newly evolved sites, while the relative position of sites had little impact on the final location of new functional sites.
The above simulation studies explicitly assume that the functions encoded in regulatory regions evolve and change with the change in sequences. There are, however, many cases like the evolution of the even-skipped enhancer mentioned above, in which the regulatory sequence changes but functions (that is, the resulting expression patterns) appear unchanged. Frequently, such genes are involved in crucial developmental processes and, therefore, subject to stringent functional constraints [15–18]. Our study thus investigates how a promoter evolves under the neutral scenario of functional maintenance in 'status quo', that is, with little or no change in the presence and strength of functional elements. Specifically, we address the expected replacement turnover rate (RTR) of TFBSs in promoter sequences in relation to evolutionary distance, insertion/deletion (InDel) rate, and restricted translocation distance of TFBSs. In accordance with previous work, our study suggests that replacement turnover of TFBSs can happen quickly in evolution and varies significantly among different TFBSs, but can be predicted using simple mathematical models.
TFBS turnover phenomena in promoter sequences raise the important question about the ability of current multiple sequence alignment (MSA) tools to identify TFBSs in comparative genomics studies. Comparative evaluations of alignment tools have been conducted previously, but usually in conjunction with a newly developed tool [19–22] and with only few attempts at a comprehensive or systematic evaluation of different tools [23–26]. However, little has been done regarding a performance evaluation of MSA tools for the task of aligning non-coding genomic sequences, largely due to lack of good benchmark datasets of real sequences. As a result, tool performance assessment on genomic sequences was often based on indirect measures, such as an alignment of putative conserved non-coding regions, functional sites , or exon regions .
Uncovering TFBSs in promoter sequences by cross-species comparison has so far been successful in some cases, but most approaches rely on alignments that are pre-computed on the whole genome. It is an open issue how appropriate these strategies are for non-coding alignments. Taking advantage of our Phylogenetic Simulation of Promoter Evolution (PSPE) simulation tool, we assess the performance of commonly used MSA algorithms for aligning TFBS in orthologous promoter sequences, where the function of a promoter (that is, an ensemble of binding sites under constraints) is maintained, but TFBS replacement turnovers are allowed to occur. Different from previous studies that assessed tool performance with respect to their ability to align homologous bases, we thus focus on assessing tool performance by their ability to align functional sites that are homologous at the functional level but may not be homologous at the sequence level. To our knowledge, no such assessment of MSA tool performance from the viewpoint of functional homology, that is, alignment of functional elements in the presence of re-arrangements and turnovers, has been carried out. Our findings can thus serve as useful references for alignment tool selection in comparative genomics and provide insights for the improvement of non-coding multiple sequence alignment.
We designed a new computational system, PSPE, specifically to perform simulations of regulatory sequence evolution, such as promoter sequences. Different from other programs for sequence evolution simulation, which frequently use different evolutionary models for functional and non-functional sites, PSPE imposes a variety of functional constraints and validates at discrete intervals that these constraints are maintained. Such functional constraints include GC content, presence and strength of functional sites, location and copy number restrictions on functional sites, and space constraints between different functional sites. Depending on the specification of these constraints, turnover events are thus possible, as functional sites are not generally tied to a specific location in the sequence.
TFBS replacement turnover rate estimation
where K is the number of different ancestral sequences, M i is the number of all descendent sequences of the ith ancestral sequence, and m i is the number of descendent sequences in which the TFBSs of interest have been subjected to replacement turnover. We also report the median values, as the distributions of RTRs are not necessarily approximate to the normal distribution.
Using PSPE for sequence evolution simulation, we are able to study the replacement turnover rate of functional conserved TFBSs in the evolution process of promoter sequences. In a complicated evolution process, many different events can occur at a TFBS, including point mutation, deletion, insertion, translocation, duplication and replacement. Our study here focuses only on TFBS replacement turnover in a simple 'status quo' scenario, assuming that all TFBSs in the sequences are essential to maintain proper gene expression levels and are thus functionally conserved in all descendent sequences. All functionally conserved TFBS are, however, allowed to be translocated to neighboring regions or replaced by newly evolved sites within a given restricted space. As ancestral sequences, we use either real or simulated human promoter sequences.
PSPE parameters for simulating sequence evolution
Original ancestral sequences
Human non-coding region
A = 0.215, C = 0.287, G = 0.285, T = 0.214
Point substitution:InDel ratio
Geometric distribution (p = 0.5)
Heterogeneity of substitution rate
Gamma (1.0) + Iota (0.1)
Range of GC content
Evolution distance per step
0.05 substitution per site
Evolution of individual binding sites
Therefore, the probability of at least one replacement turnover, or expected RTR, of a TFBS in a time interval t is:
RTR = Pr(N ≥ 1) = 1 - Pr(N = 0) = 1 - e-λt
which corresponds to the cumulative density function of an exponential distribution with mean 1/λ.
We fitted the observed E2F RTR data with this exponential model and estimated the model parameter λ. This simple exponential model fitted well with the RTR of E2F observed in our simulation (Figure 4a), where the model parameter λ was 0.0832 and 0.0724 for fitting the mean and median of the observed RTR, respectively. In other words, the average probability for a replacement turnover event of an E2F binding site was 8.3% at a divergence distance of one substitution per site, suggesting the potential of substantial E2F turnover.
To verify the RTR of E2F estimated on simulated promoter sequences, we repeated the experiment using real promoter sequences of human genes as ancestral sequences, known to be under E2F regulation from wet-lab experiments [41, 42]. Among 127 E2F regulated genes confirmed by ChIP-chip experiments , we were able to select 11 genes, each having one and only one E2F binding site in the upstream region of 500 base pairs (bp) from its transcription start site (see Materials and methods; see Additional data file 1 for details of the 11 genes). Most of the 11 genes are well known to be under regulation of E2F, especially CDC6, for which the location of the E2F binding site and functional activity of E2F have been characterized [43–45]. Real promoter sequences would presumably give us a more realistic estimate of RTR of E2F sites than starting from simulated background sequences. One such potential difference is that real promoter sequences may contain remnants or 'ghosts' of previously functional binding sites accumulated during evolution, which could become functional again by a small number of sequence changes, which would thus result in higher turnover rates.
Estimated exponential rates associated with replacement turnovers of different TFBSs
To validate the good fit of estimated turnover rates with a simple exponential model, we performed similar independent simulation studies for the additional TFBSs of Myc and NFκB. Both Myc and NFκB have palindromic binding sites with a length of 11 and 10 bases, respectively. Myc sites have more conserved positions in the center region, consisting of mixed A/T and G/C nucleotides, whereas NFκB has highly conserved positions at the two sides, consisting of mostly G/C nucleotides (Figure 3). Overall, Myc sites are the most degenerate among the three TFBSs. These differences in information content and sequence composition may lead to different RTRs. It was instructive to see how these factors affected the RTR, and whether the exponential model provided as good a fit for these other TFBS as well. For each TFBS, we again simulated 1,000 ancestral promoter sequences, and for each ancestral promoter sequence, we simulated 1,000 descendent sequences at each of 15 divergence distances as above. We also used the same substitution and InDel models for the sequence evolution (Table 1). For the purpose of comparison, we imposed the same location and copy number constraints on both TFBSs as specified in Figure 3.
Turnover rates of regulatory modules: the Myc-E2F pair
Functional constraints placed on a Myc-E2F pair in promoter sequences
E2F location relative to TSS
Myc location relative to TSS
Copy number of E2F
Copy number of Myc
DNA strand of E2F site
DNA strand of Myc site
Additional space constraint between Myc and E2F sites
We fitted the observed RTR data with both models 3 and 4. Both models fitted well with data as shown in Figure 6a,b,d, validating our assumption for the independent evolution of TFBSs. However, as the RTRs for the Myc-E2F pair in Figure 6c show, the simple models began to deviate from the simulations in more complex scenarios including dependencies between sites.
TFBS conservation between human and mouse
Because of the moderate divergence distance between mammalian genomes, such as those of human and mouse, there is a strong interest in comparative studies of their genomes as an important way to infer gene function and gene regulation as well as their evolutionary mechanisms. While it is relatively easy to compare the coding sequences of human and mouse orthologous genes, it remains a difficult task to compare their promoter sequences, largely because they are more divergent than coding sequences. One pioneering comparative genomics study estimated that a fraction as high as 32-40% of the human functional TFBSs may not be functional in rodents, suggesting a high turnover rate of TFBSs . A recent study estimated that the divergence distances of human and mouse from the last common ancestor are 0.1187 and 0.3987 substitutions per site, respectively . Another study estimated the total divergence distance of human and mouse at about 0.8 substitutions per site . Based on these two estimates, we here set the divergence distances of human and mouse from their last common ancestor to be 0.2 and 0.6, respectively, in terms of the number of substitutions per site in neutrally evolving regions. In this study, we simulated TFBS evolution of human and mouse from their last common ancestral species in the hope of shedding some light on the evolution of their TFBSs. Using the same three TFBSs as above, we estimated RTRs of individual TFBSs in human and mouse orthologous sequences at different InDel rates as well as at different restricted translocation distances.
Effect of InDel rate variation
We again simulated 1,000 ancestral promoter sequences and evolved 1,000 pairs of human and mouse descendent sequences from each ancestral sequence, but this time varying the ratio of InDel to substitution rate from 0 (that is, no InDels at all) to 0.2 (one InDel per five substitution events) at ten different steps. Except for the InDel rate, we used the same models and parameters as given in Table 1. We performed three independent simulations for the TFBSs of E2F, Myc and NFκB. The evolution of individual TFBSs was under the same functional constraints as above (Figure 3).
Rate = -a + b × e1.5γ
Estimated parameter values for the exponential model of RTR and InDel rate
Model for mean
Model for median
Influence of restricted translocation distance
TFBS often have a preferred location relative to the TSS, but many TFBSs can move within a limited distance while maintaining their regulatory function. Such a restricted translocation distance relative to the TSS may have an important impact on TFBS evolution. In a final simulation, we studied how the RTR of a TFBS between human and mouse was affected by its restricted translocation distance.
where a, c1, c2 and c are model parameters, c is the product of c1 and c2, and θ is the restriction translocation distance of a TFBS. In this model, c1 and c2 are associated with the evolutionary distances of species one and two from their last common ancestral species. Therefore, the TFBS RTR in a single species is a linear function of the square root of its restricted translocation distance. Interestingly, while the median RTRs for E2F could also be fitted quite well with this model (Figure 6a), the fit for Myc and NFκB was less good, hinting at the strong effects that different motifs can have on some of the promoter features studied here.
Impact of transition/transversion ratio
To better simulate sequences of closely related species, which generally have a higher ratio between transition and transversion substitution rates than distantly related species, we used a relatively large ratio of transition to transversion (20:1) in all the above simulations. This large ratio made sense in our case, as we simulated sequence evolution in a stepwise fashion with a small divergence distance (0.05 substitutions per site) at each step. To check whether a large change in transition to transversion ratio would have significant impact on RTRs, we also ran all the above simulations at a much smaller ratio of 4:1. We used the Wilcoxon rank sum test to check whether the difference between the means of the resulting RTRs was significantly different from zero (data not shown). We found no statistically significant differences in our results (Bonferroni-corrected significance level of P ≤ 0.05). The results suggested that our observed replacement turnovers were slow processes relative to nucleotide substitutions.
Evaluation of alignment tools
In addition to the theoretical studies regarding turnover rates, the PSPE simulator can be used to assess the impact of the turnover phenomenon on practical applications in comparative genomics. In the following, we looked specifically at the problem of identifying functional binding sites in multiple sequence alignments. Most current alignment tools are based on the assumption that the functional sites in orthologous sequences are homologous in sequence space, that is, that they can be traced back to the same position in the ancestral genome. Replacement turnover events of functional sites in promoter sequences, however, make this assumption somewhat unrealistic, which could consequently limit the performance of a tool for aligning non-coding sequences. Our evaluation aimed to: compare different multiple sequence alignment tools for their robustness to violation of this assumption; and investigate the impact of increasing the number of species on tool performance.
Simulation parameters used by PSPE for generating benchmark promoter sequences
Evolution distance per step
0.05 substitution per site
Length of root sequences
Background sequence model
Markov order of third
A = 0.258, C = 0.242, G = 0.242, T = 0.258
Gamma (1.0) + Iota (0.1)
Range of GC content
Negative binomial distribution (1, 0.5)
Functional TFBS constraints used in the promoter simulation
Location (min, max)
Copy no. (min, max)
We used this dataset to assess the performance of five widely used MSA tools: CLUSTALW , DIALIGN , AVID/MAVID [19, 51], LAGAN/MLAGAN , and MUSCLE . Among the five tools, AVID/MAVID is the fastest alignment tool and uses exactly matching words as alignment seeds to speed up the alignment process, albeit at the expense of lower alignment accuracy. As an improvement, both DIALIGN and LAGAN/MLAGAN adopt non-exact word matching for finding alignment seeds, which can improve their ability to detect degenerate functional sites. DIALIGN identifies alignment seeds by finding consistent sequence segments of a fixed length between sequences, while LAGAN/MLAGAN locates alignment seeds by chaining together neighboring similar words. Both CLUSTALW and MUSCLE are primarily based on the dynamic programming algorithm. MUSCLE, however, has made significant improvements over CLUSTALW by employing anchoring techniques and a progressive refinement approach. The performance was measured as TFBS detection accuracy, defined as the proportion of nucleotides in functionally homologous TFBSs that were correctly aligned. The detection accuracy reported here is the average value over 1,000 replicates at each divergence scale level.
In the process of evolution, selection may act directly on regulatory functions but only indirectly on gene sequences, which is supported by the experimental observations that some orthologous genes with highly conserved expression patterns have substantial divergence in their promoter sequence [7–9]. That means that functional conservation does not necessitate conservation on the sequence level. Neutral sites in promoter sequences may be free to change, and newly evolved functional sites can readily replace old ones. It is important, therefore, to understand the evolutionary mechanisms of regulatory regions in order to improve computational methods that are developed to analyze them. However, it is difficult to investigate systematically non-protein-coding evolution on real sequence data because the history of evolutionary events shaping them is largely unknown, and the map of functional sites in regulatory sequences is often incomplete and inaccurate. In many cases, there is no simple way to distinguish a site newly evolved in a replacement turnover event from one created by simple translocation of an old site. Computational simulation seems to be an effective alternative to study TFBS evolution in this case. Simulators allow us to investigate evolutionary events such as replacement turnovers of TFBS, which may significantly limit the effectiveness of phylogenetic footprinting for regulatory region identification, in an explicit way. Here, we describe a new sequence simulator to investigate the effect of different functional constraints on turnover rates, and to create a framework to evaluate multiple sequence alignment algorithms regarding their ability to detect functional elements in the presence of turnover events.
Simulation of TFBS turnover
Our simulator PSPE is designed specifically for studying the evolution of functional sites in regulatory sequences. PSPE is not only able to use one of many common models of nucleotide substitution, but it can also apply different InDel models important for regulatory sequence simulation. In contrast to other simulators, PSPE imposes a variety of functional constraints instead of sequence constraints. Such functional constraints include GC content, presence of functional sites, strength of the binding sites, location and copy number restrictions on functional sites, and space constraints between different functional sites. All these features enable PSPE to simulate evolution of promoter sequences more realistically than other simulation programs.
Consistent with previous simulation studies [5, 14], our results show that TFBS turnover can occur rapidly in promoter evolution. For example, replacement turnover events can occur at a Poisson rate as high as 0.083 for the highly constrained E2F sites even if we only allow for a small translocation distance of 50 nucleotides, and is even higher for the less constrained sites of Myc (0.22) and NFκB (0.103). Furthermore, these parameters may be relatively conservative considering that we used stringent matrix score cutoffs to avoid false hits, highly restricted locations for functional sites, a relatively low rate for transversions, and the requirement of the presence of exactly one functional site throughout. However, a high turnover rate of TFBSs can frequently be detrimental to an organism, and highly increased turnover rates may not be observed in practice, even for degenerate sites. This is supported by an additional simulation study we carried out using a lower cutoff threshold of 0.85 for functional sites, in which promoters with Myc sites had a lower RTR despite the higher chance of creating a new site at the lower cutoff. This was mainly due to our restriction of allowing only one site to be present in the promoters (see Additional data file 1 for details). Therefore, TFBS replacement turnovers in real sequences may happen more frequently than we estimated, but there is an upper limit of turnover rate for each individual TFBS imposed by the resulting changes in fitness.
Altogether, our study suggests that the TFBS RTR of a functional site between different species does not depend only on the base composition of the site and the divergence distances between species, but also on location constraints, neighboring functional sites, the InDel rate, and the GC content. While not discussed in detail, a simulation using lower GC contents showed a consistently higher or lower RTR depending on the TFBS, suggesting that the high GC content in promoter regions near the TSS is affecting the turnover rates of important functional sites (Additional data file 1). Consequently, the RTR varies not only among different functional sites and different species, but also among different instances of the same functional site upstream of different genes.
While we attempted to choose realistic model parameters and biologically meaningful functional constraints in our simulations, our estimates are certainly biased by the assumptions behind the chosen constraints, and may be substantially different from the real ones. Furthermore, the TFBS and evolution models themselves represent simplified versions of the underlying biological processes, and other factors, such as the number of replicates used in the simulation, can add some additional variation as well. We realize that the weight matrices used here as models of functional sites may not be as adequate for modeling positional dependencies as other more advanced motif models [52, 53]; however, PWMs are a valid model for many biological motifs, are available in open-access databases, and are computationally more efficient than other advanced models. Computational efficiency is an important factor in simulation studies that are as large as this one.
Simple evolutionary changes within regulatory regions, such as turnover events affecting individual sites only, can be modeled effectively by Poisson events. We could show good agreement of this for a variety of binding sites and conditions, such as different translocation distances. In theory, one could derive closed-form solutions for the probability of these events, based on the sequence composition of the region and the composition and degeneracy of a binding site. However, with an increasing number of restrictions and dependencies of sites in complex regulatory modules, this becomes increasingly cumbersome and not straightforward. Figure 6c showed that these simple models begin to deviate as soon as we address the conservation on the module level instead of individual sites only.
One can easily think of a large number of additional parameters and configurations of functional sites that we did not explore. A tool such as PSPE will allow researchers to explore empirically a wider range of restrictions and complex configurations of regulatory regions in an efficient manner. Enhancers come in many different flavors, from highly restricted 'enhanceosomes' corresponding to ultra-conserved elements, to highly flexible 'billboard' enhancers allowing for many drastic sequence changes without apparent functional consequence . PSPE is available to the public and we anticipate that it will be a beneficial tool for evolutionary biologists to explore the specific characteristics and evolutionary space of particular regulatory systems. Future extensions may include an adaptation for RNA regulatory regions, including specific modeling of compensatory mutations in RNA secondary structure, incorporating transposable elements, and neighbor-dependent substitution models.
Assessment of MSA tools
During evolution, natural selection forces impose different functional constraints on protein coding and regulatory regions. The phenomenon of frequent TFBS turnovers in regulatory regions may partially explain why comparative genomics analysis, the most powerful approach so far, has met with only limited success in identifying functional sites despite the increasing availability of whole genome sequences. TFBS turnovers may also be responsible for the weak relationship between sequence conservation and functional conservation in promoter sequences, which makes the straightforward tracing of nucleotide evolution between divergent orthologous sequences meaningless with respect to their function. Our strategy of defining conservation on the level of functional constraints such as matrix score cutoffs is similar to a recent model, which defines conservation on the level of conserved binding energy . In this sense, functional homology maps of regulatory regions, where mapped elements correspond to functionally equivalent sites, can be more important than strict sequence homology.
While many alignment tools have been developed so far, it is difficult to systematically evaluate and compare these tools, especially regarding their performance in aligning non-coding sequences, for which we have a limited understanding of evolutionary constraints. Studies that rigorously assess alignment tools (for example, ) can serve as useful reference for making more informed decisions about which tool to use for which task, and can also provide important insights or suggestions for improvement of existing algorithms. Most published evaluations of alignment algorithms were based on alignment sensitivity, specificity, and accuracy, and did not address replacement turnover of functional sites in evolution. The evaluation reported here is different: instead of trying to systematically assess all different performance aspects, we focus on one particular scenario, the capability of accurately aligning conserved TFBS in promoter sequences. Specifically, our evaluation was based on two aspects: the capability of aligning functionally homologous TFBS in promoter sequences in which TFBS replacement turnovers are allowed to occur; and the capability of increasing TFBS detection power with an increase in the number of homologous aligned species.
The five tools selected for our evaluation are representatives of many existing tools of different underlying algorithms. These differences were clearly reflected in the success of aligning TFBSs, which ranked MUSCLE at the top, AVID/MAVID at the bottom, and others in between. We purposefully chose transcription factors with long binding sites, and required strong conservation of orthologous sites (that is, a high matrix score threshold for each site). Furthermore, while our choice of constraints allowed for turnover events, it did not allow for a shuffling of sites, which none of the programs can take into account. Yet, our results suggest that the ability of existing tools to detect functionally homologous elements decreases with increasing replacement turnover rates of functional sites or, related, the sequence divergence distance. An increased divergence of the non-functional parts of the sequence does thus not necessarily help to locate individual functional binding sites, even if the sites are highly conserved and 15-20 bp long.
It is often reported that an increase in the number of species may significantly increase the power for functional site identification in comparative genomics analyses . On the contrary, our evaluation results show that we should be extremely cautious at this point to assume that this is a general property of many functional DNA regions and/or tools to analyze them. With the exception of DIALIGN, the TFBS detection accuracy of all tools was either decreased or relatively unchanged in most cases. This is in fact not surprising when ones take a closer look at the approaches used for multiple sequence alignment. CLUSTALW, MAVID, and MLAGAN all use the same progressive approach for aligning multiple sequences, in which intermediate alignments from the early stages are not allowed to change in later stages. That means that the mistakes that happen in an early stage of alignment will be propagated and cannot be corrected at a later stage. Since a tool based on the progressive approach can only accumulate more mistakes when aligning more sequences, it is conceivable that its performance decreases as the number of species increases. MUSCLE employs an improved progressive approach that allows changes in the alignment of sub-groups in a recursive refinement process, which explains why MUSCLE did not show a significant decrease in performance as the number of species increased. It is conceivable, however, that the particular choice of species, and the order in which they are presented to a phylogenetic aligner, may significantly change the accuracy of these approaches. DIALIGN is the only tool surveyed here that does not use the progressive approach. Instead, it assembles the whole alignment by greedily finding all consistent segments of significant similarity from all sequences , which allows DIALIGN to be able to take advantage of the information from additional species. While these features of DIALIGN are interesting, there is still much room for improvement as its overall performance is no better than MUSCLE.
We want to stress that the tools in this study were not specifically developed for the alignment of non-coding regions. In fact, some design principles may be counterproductive for this task: whole genome alignments are built to provide fast comparative maps and are certainly able to detect coding conservation. The progressive aligners in our evaluation are meant to provide the phylogenetic history, that is, to compute an accurate alignment of bases that are derived from the same nucleotide in the ancestral genome. Yet, there is no doubt that many researchers currently use these tools in studies concerning gene regulatory sequences, and we hope that this evaluation provides clues about what to expect if they are used in this way. We aimed to include a representative subset of tools fast enough to perform extensive comparisons. We do not expect this to have introduced a systematic bias, but of course some recently developed aligners (for example, TBA , Prank , or Probcons ) may perform differently to our selected set.
The objective and systematic evaluation of alignment tools is a challenging task, in particular for an assessment on non-coding sequences whose actual functional and evolutionary mechanisms remain largely unknown. Since we simulated data under a set of specific conditions that are unlikely to represent all actual scenarios, one should carefully interpret our comparison results. For example, our study did not consider the ability of a tool to deal with very large insertions and deletions because of few large insertions/deletions in our simulated data. Furthermore, we were very conservative in our constraints, and, for example, allowed for turnover, but not for a shuffling of sites. Our results are therefore a rather optimistic estimate, and performance on real promoters with shorter sites that do not preserve their order can be expected to be significantly worse. We are also aware that the criteria for tool performance can be different in a different study depending on its objectives. Therefore, our results may not be applicable for some studies, such as the estimation of divergence distances between species. For such cases, the recent evaluation by Pollard et al.  may be a better reference.
TFBS replacement turnover is an important phenomenon in the process of promoter evolution, and providing a framework to address it systematically is critical for our understanding of the mechanisms driving promoter evolution. We introduced the new simulation system PSPE, designed specifically for regulatory sequences, and allowing for functional site turnover events. PSPE is freely available at the authors' websites [61, 62]. Applying PSPE in a large-scale simulation, we found that replacement turnovers could happen rapidly in promoter evolution. We also investigated different factors besides the divergence distance that significantly affect turnover rates, and describe the relationships between the RTR and different factors in simple mathematical models. Our study adds to the increasing evidence that it is important and advantageous to trace homology on the functional rather than on the sequence base-pair level in cross-species comparisons of regulatory sequences.
PSPE also provides a flexible system to generate appropriate standard test sets for alignment or motif finding algorithms, and we presented first results of this application. To our knowledge, our evaluation of MSA tools is the first one to assess their ability to detect TFBSs that are homologous on a functional level. Our evaluation of five widely used MSA tools suggests that the turnover of functional sites poses a challenge for alignment tools, even for the simplified case where the functional sites remain co-linear in orthologous sequences. While all MSA tools under consideration, especially MUSCLE, performed well in aligning functional sites at short or moderate divergence distances, they appeared to lack sufficient capability to align functional sites that have high RTRs in divergent sequences. In addition, our study suggests that the widely used progressive approach for MSA is counterproductive for the multiple alignments of homologous non-coding sequences, and that MUSCLE's improved progressive approach and DIALIGN's segment assembling approach are better suited for non-coding MSA. Some recent approaches are promising to successfully deal with the specific challenges of non-coding alignments, for example, by using available models of TFBS to 'anchor' alignments . However, this still leaves us with a number of open issues on the way towards computational tools that will help us to elucidate the structure and evolution of regulatory regions.
Materials and methods
Background model of ancestral sequences
To generate biologically relevant ancestral sequences, we used a 3rd order Markov model to generate background sequences of ancestral promoters. We trained the background Markov model on a large real dataset of regulatory sequences extracted from the NCBI human RefSeq database (build 35). The dataset consists of 25,088 human promoter sequences each spanning a region of 500 bp immediately upstream of the transcription start sites. The base frequencies of four nucleotides were also estimated on this dataset.
Selection of E2F regulated ancestral genes
We obtained 127 experimentally confirmed E2F regulated genes from a previous publication . We removed the genes for which we were not able to extract their promoter sequences from NCBI, and extracted 500 bp long promoter sequences upstream of their annotated transcription start site. We then identified potential E2F sites in each promoter sequence using the PWM model. We removed those genes that had either zero or more than one E2F binding site based on the cutoff score of 0.92 given in Figure 3. The remaining 11 genes are given in Additional data file 1 and were used as ancestral sequences for our simulation study.
Motif model of TFBSs
The function gives a normalized score from 0 to 1 for any TFBS, with 0 for the most unlikely and 1 for the most likely site. A functional binding site defined in this study is a site having a score larger than a certain cutoff threshold.
Position weight matrices of E2F, Myc, and NFκB were taken from the JASPAR database [39, 65]. The functional sites in simulated ancestral sequences were generated from these PWMs. To maintain a low false positive rate of binding sites, we chose a relative strict cutoff score for each TFBS. At a cutoff score of 0.92, we estimated the fraction of false positive predictions to be less than 5% of the total number of planted sites on coding regions of human genes in the RefSeq database.
DNA substitution models
where μ ij is the instantaneous substitution rate of nucleotide i by j, π i is the frequency of nucleotide i, and where i = a, c, t, g; j = a, c, t, g. The other substitution models above can be expressed as special cases of the GTR model. From the Q matrix, we can obtain the matrix of nucleotide transition probabilities in continuous time by:
P(t) = e kQrt
where ι is the proportion of invariant rates and α is the shape parameter of the gamma distribution.
PSPE is based on Dawg , an earlier sequence evolution simulation system, and in particular adopted its range of InDel formation model. The model is based on a Poisson process that assumes InDel formation to happen at a fixed, instantaneous rate at any site. The model treats insertions and deletions as two separate processes. Under the model, the time intervals between two insertions and those between two deletions follow exponential distributions with means [λIns (L + 1)]-1 and [λDel (L + u - 1)]-1, respectively, where L is the sequence length, u is the mean length of deletion segments, and λIns and λDel are Poisson rates of insertion and deletion, respectively.
PSPE models InDel length by one of three commonly used distributions: geometric, negative binomial, and Zipf's law distributions. We used the simple geometric distribution for InDel length in this study.
Simulation of sequence evolution
To address TFBS turnover at different distances, we simulated sequence evolution at each of 15 different divergence distances: 0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 2.5, 3.0, 4.0, and 5.0, measured in the number of substitutions per site. These distances should cover the divergence between most currently sequenced genomes used in comparative genomics studies. To study the effect of different InDel rates on TFBS conservation between human and mouse, we performed simulations at ten different InDel rates measured by the number of InDels per substitution: 0, 0.01, 0.025, 0.05, 0.075, 0.1, 0.125, 0.15, 0.175, 0.2, with, for example, 0 meaning no InDel, and 0.2 meaning one InDel every five substitutions. For studying the effects of restricted translocation distances on the RTRs, we compared RTRs of each TFBS at 12 different maximal translocation distances: 0, 5, 10, 20, 30, 40, 50, 75, 100, 150, 200, and 300 bases.
Model fitting method
To estimate parameters of our RTR models, we did a non-linear least-squares regression analysis on the observed turnover data from simulation. We used the Gauss-Newton algorithm for the non-linear least-squares fitting, which minimizes the sum-of-squares error. We performed this analysis in R.
We selected five alignment tools, CLUSTALW, DIALIGN, AVID/MAVID, LAGAN/MLAGAN, and MUSCLE, to assess their capability of detecting functional sites. The criteria for our selection were: either widely used or shown good performance in other studies for aligning DNA sequences; capable of aligning multiple sequences in a reasonable amount of time; free availability and easy installation in the Linux operating system; and strict co-linearity for global alignment. Many excellent alignment tools were not evaluated because: they did not meet one of the above criteria; their algorithms were similar to one of five tools we selected; and/or they did not show a significantly different performance in other evaluations. For example, ACANA , BLASTZ , MUMmer , and SSEARCH  were not selected because of the first criterion, and T-COFFEE , POA  and MAFFT  were not evaluated due to the second criterion. Our study is therefore not considered to be a systematic evaluation of all available good alignment tools and rather as a representative but somewhat subjective cross-section. Each of the five tools is briefly described below.
CLUSTALW (v1.83) is one of the best-known MSA tools and is based on a progressive method, which first aligns the most similar sequences and then successively adds more distant sequences or groups to the alignment until all sequences are aligned together. CLUSTALW employs the Needleman-Wunsch pair-wise alignment algorithm for calculating similarities between sequences and constructing a phylogenetic guide tree. We ran CLUSTALW with its default settings.
DIALIGN (v2.2.1) is an anchor-based MSA tool for both DNA and protein sequences. Different from other alignment tools, DIALIGN assembles alignments by finding consistent segments exhibiting statistically significant similarity, and does not align regions showing no significant similarity. Therefore, strictly speaking, DIALIGN is a local alignment tool that produces a full global alignment only for sequences with high similarity. Recent studies [21, 28, 79] suggested that DIALIGN performed well in aligning sequences of low similarity with long insertions and deletions. DIALIGN was run with default parameters.
AVID (v2.1) is a pair-wise global alignment tool that is capable of aligning very large genomics sequences. Its employs an anchor-based approach and uses a suffix tree algorithm for identifying potential anchoring regions between sequences. MAVID (v2.0.4) is a progressive MSA tool and is the direct extension of AVID. To speed up the alignment process, MAVID does not directly align two intermediate alignments or groups; instead, it first infers the common ancestral sequence of each alignment by maximum likelihood, and then uses AVID to align two ancestral sequences. We used AVID for two-species alignments, and MAVID for three or more species. CLUSTALW at default settings was used as the tree building tool for MAVID.
LAGAN/MLAGAN (v1.21) is a suite of programs for aligning DNA sequences. As an anchor-based pair-wise alignment tool, LAGAN first identifies anchoring regions by chaining similar neighboring words found in a local alignment process, and subsequently aligns other regions by the Needleman-Wunsch algorithm to form a global alignment. MLAGAN is a progressive multiple sequence alignment tool based on LAGAN pair-wise alignments. Similar to CLUSTALW in scoring multiple sequence alignments, it uses the sum-of-pairs approach for scoring substitutions and a consensus-based method for scoring gaps. Since MLAGAN does not build a phylogenetic tree, which is a required input, we provided it with the phylogenetic tree from our simulation. In this study, LAGAN was used for two-species alignments and MLAGAN for three or more species. Both tools were run with default settings.
MUSCLE (v3.6) is a relatively new MSA tool for both DNA and protein sequences based on an improved progressive approach. Like CLUSTALW and T-COFFEE, MUSCLE is based on a progressive approach, and uses the sum-of-pairs scoring scheme for multiple alignments, but it differs from other progressive tools by allowing changes of both the phylogenetic tree and the alignment in intermediate steps in an iterative refining process. MUSCLE was shown to perform better than T-COFFEE and POA in aligning benchmark protein sequences [20, 81]. Because MUSCLE is quite slow when running on default settings, we used its diags option for anchoring alignment and maxiters option to limit the number of refinement iterations to two.
Simulation of alignment benchmark data
The benchmark sequence datasets were simulated by the PSPE system. PSPE sequence simulation can be generally divided into two separate steps. In the first step, PSPE generates ancestral promoter sequences with different functional TFBSs (see Table 6 for the sites we used in this study). With the exception of YY1E2F, a composite of YY1 and E2F and chosen on purpose for this study, the TFBS sites used here were arbitrarily selected among those satisfying three criteria: binding site of a human transcription factor; PWM available in the JASPAR database; and length between 12 and 25 bp. A motif instance, which is generated randomly from a PWM, is defined as a functional site if its score is larger than a certain cutoff value. To avoid degenerate motifs, we used a relatively stringent cutoff of 0.90 for all TFBSs. PSPE first generates functional sites and a map of their locations, and then fills the remaining region in promoters with background sequences. To generate biologically relevant background sequences, we estimated parameter values of a 3rd order Markov chain from a large real sequence data set and used them to simulate the ancestral background sequences. The actual sequence data consisted of human promoter sequences of 24,649 transcripts extracted from the NCBI human RefSeq database (build 35). Each sequence in the training data comprises a 5,000 bp region upstream of the putative transcription start site.
In the second step, PSPE simulates descendent sequences from the simulated ancestral promoter sequences according to specified evolutionary models and functional constraints (Table 5). For the evaluation on the mammalian tree of five species, we scaled the tree, relative to evolutionary distance, at the following eight levels: 0.25, 0.5, 1, 1.5, 2, 3, 4, and 5, which we refer to as divergence scale coefficient. At each scale level, PSPE generated 1,000 sets, each consisting of five descendent mammalian sequences from their common ancestral sequence of length 3,000 bp.
We use the expressions 'a tool alignment' to refer to an alignment produced by an alignment tool, 'a simulated alignment' to refer to a correct alignment of homologous base pairs from the simulation, and 'a simulated TFBS map' to refer to a correct alignment of TFBSs homologous on the functional level. Both 'a simulated alignment' and 'a simulated TFBS map' were generated by the PSPE simulator. The ability of an alignment tool to detect functional sites was assessed by TFBS detection accuracy. Tool performance was also assessed by four additional measures: overall alignment sensitivity, overall alignment specificity, TFBS sensitivity and TFBS specificity, which are similar to measures in . Definitions of these measures are given below.
where n is the number of sequences in the alignment, n(n-1)/2 is the number of different sequence pairs, and L is the total length of functional sites (for a DA of an individual functional site, L is the length of the site). a ij has a value of 1 if the bases at the jth position of functionally homologous binding sites in a sequence pair i are aligned to each other and 0 otherwise. Since all our simulated sequences contain the same set of TFBSs, L is the same for all alignments.
where k i is the total number of positions for which bases are aligned to bases in the ith pair-wise alignment from a simulated alignment. c ij has a value of 1 if the jth position was aligned correctly in any given pair of sequences i in the tool alignment, and 0 otherwise.
where g i is the total number of positions, where bases in one sequence aligned to gaps in the other sequence in the ith pair-wise alignment of a simulated alignment; c ij has a value of 1 if the jth position aligned correctly, and 0 otherwise.
where l i is the total number of positions in functional sites of ancestral sequences, where bases in one sequence aligned to bases from the other sequence in the ith pair-wise alignment of a simulated alignment; c ij has a value of 1 if the jth position aligned correctly, and 0 otherwise. In case of no replacement turnovers of TFBSs, TFBS sensitivity is equal to TFBS detection accuracy.
where L is total length of all functional sites in the ancestral sequence.
In this paper, we mostly use detection accuracy and TFBS sensitivity. The former refers to the fraction of functional sites in the 'descendants' aligned to each other, the latter to the fraction of sites corresponding to the location of the TFBS in the 'ancestral' sequence that are correctly aligned to each other.
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 provides additional results of turnover simulations varying the binding site strength and GC content of the background sequences, and information on the E2F promoter data set. Additional data file 2 provides additional evaluations of alignment algorithms on sequence sets simulated with a phylogenetic tree with a star topology.
general time reversible
multiple sequence alignment
Phylogenetic Simulation of Promoter Evolution
position weight matrix
replacement turnover rate
transcription factor binding site
transcription start site.
The authors wish to thank Dr Greg Wray at Duke University for many beneficial discussions. UO is an Alfred P Sloan fellow in Computational and Evolutionary Molecular Biology and is funded by NIH grant 1-R01-HG004065. JRN is funded by grants from the NIH, 1-RO1-CA104663-05, 1-RO1-CA106520-03 and 5-U54-CA112952-03. This research was also supported in part by the National Science Foundation under grant NSF PHY05-51164 through a program participation of UO at the Kavli Institute for Theoretical Physics.
- Loots GG, Ovcharenko I, Pachter L, Dubchak I, Rubin EM: rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res. 2002, 12: 832-839. 10.1101/gr.225502. Article published online before print in April 2002.PubMedPubMed CentralView ArticleGoogle Scholar
- Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman WW: Identification of conserved regulatory elements by comparative genome analysis. J Biol. 2003, 2: 13-10.1186/1475-4924-2-13.PubMedPubMed CentralView ArticleGoogle Scholar
- Ludwig MZ: Functional evolution of noncoding DNA. Curr Opin Genet Dev. 2002, 12: 634-639. 10.1016/S0959-437X(02)00355-6.PubMedView ArticleGoogle Scholar
- Maduro M, Pilgrim D: Conservation of function and expression of unc-119 from two Caenorhabditis species despite divergence of non-coding DNA. Gene. 1996, 183: 77-85. 10.1016/S0378-1119(96)00491-X.PubMedView ArticleGoogle Scholar
- Stone JR, Wray GA: Rapid evolution of cis-regulatory sequences via local point mutations. Mol Biol Evol. 2001, 18: 1764-1770.PubMedView ArticleGoogle Scholar
- Dermitzakis ET, Clark AG: Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. Mol Biol Evol. 2002, 19: 1114-1121.PubMedView ArticleGoogle Scholar
- Ludwig MZ, Bergman C, Patel NH, Kreitman M: Evidence for stabilizing selection in a eukaryotic enhancer element. Nature. 2000, 403: 564-567. 10.1038/35000615.PubMedView ArticleGoogle Scholar
- Ludwig MZ, Palsson A, Alekseeva E, Bergman CM, Nathan J, Kreitman M: Functional evolution of a cis-regulatory module. PLoS Biol. 2005, 3: e93-10.1371/journal.pbio.0030093.PubMedPubMed CentralView ArticleGoogle Scholar
- Ludwig MZ, Patel NH, Kreitman M: Functional analysis of eve stripe 2 enhancer evolution in Drosophila: rules governing conservation and change. Development. 1998, 125: 949-958.PubMedGoogle Scholar
- Frith MC, Ponjavic J, Fredman D, Kai C, Kawai J, Carninci P, Hayashizaki Y, Sandelin A: Evolutionary turnover of mammalian transcription start sites. Genome Res. 2006, 16: 713-722. 10.1101/gr.5031006.PubMedPubMed CentralView ArticleGoogle Scholar
- Carter AJ, Wagner GP: Evolution of functionally conserved enhancers can be accelerated in large populations: a population-genetic model. Proc Biol Sci. 2002, 269: 953-960. 10.1098/rspb.2002.1968.PubMedPubMed CentralView ArticleGoogle Scholar
- Gerland U, Hwa T: On the selection and evolution of regulatory DNA motifs. J Mol Evol. 2002, 55: 386-400. 10.1007/s00239-002-2335-z.PubMedView ArticleGoogle Scholar
- Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, Rockman MV, Romano LA: The evolution of transcriptional regulation in eukaryotes. Mol Biol Evol. 2003, 20: 1377-1419. 10.1093/molbev/msg140.PubMedView ArticleGoogle Scholar
- MacArthur S, Brookfield JFY: Expected rates and modes of evolution of enhancer sequences. Mol Biol Evol. 2004, 21: 1064-1073. 10.1093/molbev/msh105.PubMedView ArticleGoogle Scholar
- Lee AP, Koh EG, Tay A, Brenner S, Venkatesh B: Highly conserved syntenic blocks at the vertebrate Hox loci and conserved regulatory elements within and outside Hox gene clusters. Proc Natl Acad Sci USA. 2006, 103: 6994-6999. 10.1073/pnas.0601492103.PubMedPubMed CentralView ArticleGoogle Scholar
- Santini S, Boore JL, Meyer A: Evolutionary conservation of regulatory elements in vertebrate Hox gene clusters. Genome Res. 2003, 13: 1111-1122. 10.1101/gr.700503.PubMedPubMed CentralView ArticleGoogle Scholar
- Ruvinsky I, Ruvkun G: Functional tests of enhancer conservation between distantly related species. Development. 2003, 130: 5133-5142. 10.1242/dev.00711.PubMedView ArticleGoogle Scholar
- Davidson EH: Genomic Regulatory Systems: Development and Evolution. 2001, San Diego, CA: Academic PressGoogle Scholar
- Bray N, Dubchak I, Pachter L: AVID: A global alignment program. Genome Res. 2003, 13: 97-102. 10.1101/gr.789803.PubMedPubMed CentralView ArticleGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32: 1792-1797. 10.1093/nar/gkh340.PubMedPubMed CentralView ArticleGoogle Scholar
- Huang W, Umbach DM, Li L: Accurate anchoring alignment of divergent sequences. Bioinformatics. 2006, 22: 29-34. 10.1093/bioinformatics/bti772.PubMedView ArticleGoogle Scholar
- Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000, 302: 205-217. 10.1006/jmbi.2000.4042.PubMedView ArticleGoogle Scholar
- McClure MA, Vasi TK, Fitch WM: Comparative analysis of multiple protein-sequence alignment methods. Mol Biol Evol. 1994, 11: 571-592.PubMedGoogle Scholar
- Sauder JM, Arthur JW, Dunbrack RL: Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins. 2000, 40: 6-22. 10.1002/(SICI)1097-0134(20000701)40:1<6::AID-PROT30>3.0.CO;2-7.PubMedView ArticleGoogle Scholar
- Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999, 27: 2682-2690. 10.1093/nar/27.13.2682.PubMedPubMed CentralView ArticleGoogle Scholar
- Lassmann T, Sonnhammer EL: Quality assessment of multiple alignment programs. FEBS Lett. 2002, 529: 126-130. 10.1016/S0014-5793(02)03189-7.PubMedView ArticleGoogle Scholar
- Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003, 13: 721-731. 10.1101/gr.926603.PubMedPubMed CentralView ArticleGoogle Scholar
- Pollard DA, Bergman CM, Stoye J, Celniker SE, Eisen MB: Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics. 2004, 5: 6-10.1186/1471-2105-5-6.PubMedPubMed CentralView ArticleGoogle Scholar
- Pollard DA, Moses AM, Iyer VN, Eisen MB: Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments. BMC Bioinformatics. 2006, 7: 376-10.1186/1471-2105-7-376.PubMedPubMed CentralView ArticleGoogle Scholar
- DeGregori J, Leone G, Miron A, Jakoi L, Nevins JR: Distinct roles for E2F proteins in cell growth control and apoptosis. Proc Natl Acad Sci USA. 1997, 94: 7245-7250. 10.1073/pnas.94.14.7245.PubMedPubMed CentralView ArticleGoogle Scholar
- Muller H, Bracken AP, Vernell R, Moroni MC, Christians F, Grassilli E, Prosperini E, Vigo E, Oliner JD, Helin K: E2Fs regulate the expression of genes involved in differentiation, development, proliferation, and apoptosis. Genes Dev. 2001, 15: 267-285. 10.1101/gad.864201.PubMedPubMed CentralView ArticleGoogle Scholar
- Bernard S, Eilers M: Control of cell proliferation and growth by Myc proteins. Results Probl Cell Differ. 2006, 42: 329-342.PubMedView ArticleGoogle Scholar
- Amati B, Littlewood TD, Evan GI, Land H: The c-Myc protein induces cell cycle progression and apoptosis through dimerization with Max. EMBO J. 1993, 12: 5083-5087.PubMedPubMed CentralGoogle Scholar
- La Thangue NB: Transcription. Chromatin control - a place for E2F and Myc to meet. Science. 2002, 296: 1034-1035. 10.1126/science.1072446.PubMedView ArticleGoogle Scholar
- Ogawa H, Ishiguro K, Gaubatz S, Livingston DM, Nakatani Y: A complex with chromatin modifiers that occupies E2F- and Myc-responsive genes in G0 cells. Science. 2002, 296: 1132-1136. 10.1126/science.1069861.PubMedView ArticleGoogle Scholar
- Ditsworth D, Zong WX: NF-kappaB: key mediator of inflammation-associated cancer. Cancer Biol Ther. 2004, 3: 1214-1216.PubMedView ArticleGoogle Scholar
- Dobrovolskaia MA, Kozlov SV: Inflammation and cancer: when NF-kappaB amalgamates the perilous partnership. Curr Cancer Drug Targets. 2005, 5: 325-344. 10.2174/1568009054629645.PubMedView ArticleGoogle Scholar
- Luo JL, Kamata H, Karin M: IKK/NF-kappaB signaling: balancing life and death - a new approach to cancer therapy. J Clin Invest. 2005, 115: 2625-2632. 10.1172/JCI26322.PubMedPubMed CentralView ArticleGoogle Scholar
- Vlieghe D, Sandelin A, De Bleser PJ, Vleminckx K, Wasserman WW, van Roy F, Lenhard B: A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res. 2006, 34: D95-97. 10.1093/nar/gkj115.PubMedPubMed CentralView ArticleGoogle Scholar
- Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985, 22: 160-174. 10.1007/BF02101694.PubMedView ArticleGoogle Scholar
- Bieda M, Xu X, Singer MA, Green R, Farnham PJ: Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome. Genome Res. 2006, 16: 595-605. 10.1101/gr.4887606.PubMedPubMed CentralView ArticleGoogle Scholar
- Ren B, Cam H, Takahashi Y, Volkert T, Terragni J, Young RA, Dynlacht BD: E2F integrates cell cycle progression with DNA repair, replication, and G(2)/M checkpoints. Genes Dev. 2002, 16: 245-256. 10.1101/gad.949802.PubMedPubMed CentralView ArticleGoogle Scholar
- Yan Z, DeGregori J, Shohet R, Leone G, Stillman B, Nevins JR, Williams RS: Cdc6 is regulated by E2F and is essential for DNA replication in mammalian cells. Proc Natl Acad Sci USA. 1998, 95: 3603-3608. 10.1073/pnas.95.7.3603.PubMedPubMed CentralView ArticleGoogle Scholar
- Hateboer G, Wobst A, Petersen BO, Le Cam L, Vigo E, Sardet C, Helin K: Cell cycle-regulated expression of mammalian CDC6 is dependent on E2F. Mol Cell Biol. 1998, 18: 6679-6697.PubMedPubMed CentralView ArticleGoogle Scholar
- Ohtani K, Tsujimoto A, Ikeda M, Nakamura M: Regulation of cell growth-dependent expression of mammalian CDC6 gene by the cell cycle transcription factor E2F. Oncogene. 1998, 17: 1777-1785. 10.1038/sj.onc.1202105.PubMedView ArticleGoogle Scholar
- Siepel A, Haussler D: Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2004, 21: 468-488. 10.1093/molbev/msh039.PubMedView ArticleGoogle Scholar
- Ogurtsov AY, Sunyaev S, Kondrashov AS: Indel-based evolutionary distance and mouse-human divergence. Genome Res. 2004, 14: 1610-1616. 10.1101/gr.2450504.PubMedPubMed CentralView ArticleGoogle Scholar
- Schlisio S, Halperin T, Vidal M, Nevins JR: Interaction of YY1 with E2Fs, mediated by RYBP, provides a mechanism for specificity of E2F function. EMBO J. 2002, 21: 5775-5786. 10.1093/emboj/cdf577.PubMedPubMed CentralView ArticleGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680. 10.1093/nar/22.22.4673.PubMedPubMed CentralView ArticleGoogle Scholar
- Morgenstern B: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics. 1999, 15: 211-218. 10.1093/bioinformatics/15.3.211.PubMedView ArticleGoogle Scholar
- Bray N, Pachter L: MAVID multiple alignment server. Nucleic Acids Res. 2003, 31: 3525-3526. 10.1093/nar/gkg623.PubMedPubMed CentralView ArticleGoogle Scholar
- Huang W, Umbach DM, Ohler U, Li L: Optimized mixed Markov models for motif identification. BMC Bioinformatics. 2006, 7: 279-10.1186/1471-2105-7-279.PubMedPubMed CentralView ArticleGoogle Scholar
- Yeo G, Burge CB: Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol. 2004, 11: 377-394. 10.1089/1066527041410418.PubMedView ArticleGoogle Scholar
- Arnosti DN, Kulkarni MM: Transcriptional enhancers: Intelligent enhanceosomes or flexible billboards?. J Cell Biochem. 2005, 94: 890-898. 10.1002/jcb.20352.PubMedView ArticleGoogle Scholar
- Mustonen V, Lassig M: Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. Proc Natl Acad Sci USA. 2005, 102: 15936-15941. 10.1073/pnas.0505537102.PubMedPubMed CentralView ArticleGoogle Scholar
- Margulies EH, Chen CW, Green ED: Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons. Trends Genet. 2006, 22: 187-193. 10.1016/j.tig.2006.02.005.PubMedView ArticleGoogle Scholar
- Morgenstern B, Dress A, Werner T: Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc Natl Acad Sci USA. 1996, 93: 12098-12103. 10.1073/pnas.93.22.12098.PubMedPubMed CentralView ArticleGoogle Scholar
- Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004, 14: 708-715. 10.1101/gr.1933104.PubMedPubMed CentralView ArticleGoogle Scholar
- Loytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci USA. 2005, 102: 10557-10562. 10.1073/pnas.0409137102.PubMedPubMed CentralView ArticleGoogle Scholar
- Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 2005, 15: 330-340. 10.1101/gr.2821705.PubMedPubMed CentralView ArticleGoogle Scholar
- Weichun Huang's Research Domain. [http://biomedempire.org]
- Uwe Ohler's Gene Regulation Webpage. [http://tools.genome.duke.edu/generegulation]
- Blanco E, Messeguer X, Smith TF, Guigo R: Transcription factor map alignment of promoter regions. PLoS Comput Biol. 2006, 2: e49-10.1371/journal.pcbi.0020049.PubMedPubMed CentralView ArticleGoogle Scholar
- Quandt K, Frech K, Karas H, Wingender E, Werner T: MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 1995, 23: 4878-4884. 10.1093/nar/23.23.4878.PubMedPubMed CentralView ArticleGoogle Scholar
- Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 2004, 32: D91-94. 10.1093/nar/gkh012.PubMedPubMed CentralView ArticleGoogle Scholar
- Jukes TH, Cantor CR: Evolution of protein molecules. Mammalian Protein Metabolism. Edited by: Munro MN. 1969, New York: Academic Press, 3: 21-132.View ArticleGoogle Scholar
- Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981, 17: 368-376. 10.1007/BF01734359.PubMedView ArticleGoogle Scholar
- Kimura M: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16: 111-120. 10.1007/BF01731581.PubMedView ArticleGoogle Scholar
- Tamura K, Nei M: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993, 10: 512-526.PubMedGoogle Scholar
- Lanave C, Preparata G, Saccone C, Serio G: A new method for calculating evolutionary substitution rates. J Mol Evol. 1984, 20: 86-93. 10.1007/BF02101990.PubMedView ArticleGoogle Scholar
- Tavare S: Some probabilistic and statisical problems on the analysis of DNA sequences. Lect Math Life Sci. 1986, 17: 57-86.Google Scholar
- Rodriguez F, Oliver JL, Marin A, Medina JR: The general stochastic model of nucleotide substitution. J Theor Biol. 1990, 142: 485-501.PubMedView ArticleGoogle Scholar
- Yang Z: Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 1994, 39: 306-314. 10.1007/BF00160154.PubMedView ArticleGoogle Scholar
- Waddell PJ, Steel MA: General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. Mol Phylogenet Evol. 1997, 8: 398-414. 10.1006/mpev.1997.0452.PubMedView ArticleGoogle Scholar
- Cartwright RA: DNA assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics. 2005, 21 (Suppl 3): iii31-iii38. 10.1093/bioinformatics/bti1200.PubMedView ArticleGoogle Scholar
- Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res. 2003, 13: 103-107. 10.1101/gr.809403.PubMedPubMed CentralView ArticleGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol. 2004, 5: R12-10.1186/gb-2004-5-2-r12.PubMedPubMed CentralView ArticleGoogle Scholar
- Rambaldi D, Guffanti A, Morandi P, Cassata G: NemaFootPrinter: a web based software for the identification of conserved non-coding genome sequence regions between C. elegans and C. briggsae. BMC Bioinformatics. 2005, 6 (Suppl 4): S22-10.1186/1471-2105-6-S4-S22.PubMedPubMed CentralView ArticleGoogle Scholar
- Lee C, Grasso C, Sharlow MF: Multiple sequence alignment using partial order graphs. Bioinformatics. 2002, 18: 452-464. 10.1093/bioinformatics/18.3.452.PubMedView ArticleGoogle Scholar
- Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005, 33: 511-518. 10.1093/nar/gki198.PubMedPubMed CentralView ArticleGoogle Scholar
- Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004, 5: 113-10.1186/1471-2105-5-113.PubMedPubMed CentralView ArticleGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14: 1188-1190. 10.1101/gr.849004.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.