Computational prediction of transcription-factor binding site locations
© BioMed Central Ltd 2003
Published: 23 December 2003
Skip to main content
© BioMed Central Ltd 2003
Published: 23 December 2003
Identifying genomic locations of transcription-factor binding sites, particularly in higher eukaryotic genomes, has been an enormous challenge. Various experimental and computational approaches have been used to detect these sites; methods involving computational comparisons of related genomes have been particularly successful.
The publication of a nearly complete draft sequence of the human genome is an enormous achievement, but characterizing the entire set of functional elements encoded in the human and other genomes remains an immense challenge . Francis Collins, Director of the National Human Genome Research Institute (USA), has proposed that "the next phase of genomics is to catalog, characterize and comprehend the entire set of functional elements [including those that do not encode protein] encoded in the human and other genomes" . Two of the most important functional elements in any genome are transcription factors (TFs) and the sites within the DNA to which they bind. These interactions between protein and DNA control many important processes, such as critical steps in development and responses to environmental stresses, and defects in them can contribute to the progression of various diseases. Much progress has been made recently in the accumulation and analysis of mRNA transcript profiles of a variety of cell and tissue types, including those associated with various human diseases ; much remains to be understood, however, about the transcriptional regulatory networks that govern these expression profiles. A more complete understanding of transcription factors, their DNA binding sites, and their interactions, will permit a more comprehensive and quantitative mapping of the regulatory pathways within cells, as well as a deeper understanding of the potential functions of individual genes regulated by newly identified DNA-binding sites.
Much of the information on TF binding specificity has been determined using traditional methodologies such as footprinting methods that identify the region of DNA protected by a bound protein, nitrocellulose binding assays, gel-shift analysis that monitors the change in mobility when DNA and protein bind, Southwestern blotting of both DNA and protein, or reporter constructs. These methods are generally quite time-consuming and not readily scaled up to whole genomes, however. In recent years, therefore, a number of high-throughput technologies have been developed, for identifying TFBSs both in vitro and in vivo. One high-throughput method for finding high-affinity binding sequences in vitro is the selection (frequently referred to as SELEX (systematic evolution of ligands by exponential evolution)) from randomized double-stranded DNAs those that bind with high affinity to a protein of interest . This method has been further modified into genomic SELEX, which uses a genomic library as the starting material for the selections . More recently, the sequence specificities of DNA-binding proteins have been determined by direct binding of proteins to double-stranded DNA microarrays [7, 8].
Similarly, high-throughput methods have also been developed for measuring the interactions between DNA and TFs in vivo. Microarray-based readout of chromatin immunoprecipitation assays ('ChIP-chip'), also referred to as genome-wide location analysis , is currently the most widely used method for identifying genomic TFBSs in vivo and in a high-throughput manner (see  for a review). This approach has been used to characterize a number of TFs in the yeast Saccharomyces cerevisiae [9, 11–15] and, more recently, to identify genomic targets in mammalian cells [16–18]. Another recently developed method that takes advantage of DNA microarrays for the identification of TFBSs in vivo uses TFs tethered to DNA adenine methyltransferase (Dam) [19, 20], resulting in DNA methylation near sites bound by the TF-Dam fusion protein [19, 20]. This approach has been used to identify binding sites in vivo in Drosophila [20, 21] and Arabidopsis .
Once a regulatory sequence motif has been identified, the next goal is frequently to identify candidate target genes that may be regulated through it, potentially by a TF that may bind to it. Although degenerate consensus sequences (Figure 1a) are still frequently used to depict the binding specificities of TFs, they do not contain precise information about the relative likelihood of observing the alternate nucleotides at the various positions of a TFBS. Thus, a common way of representing the degenerate sequence preferences of a DNA-binding protein is by a position weight matrix (PWM), also known as a position-specific scoring matrix (PSSM) (see  for review). Briefly, the elements of a PWM correspond to scores reflecting the likelihood of observing that particular nucleotide at that particular position of the known or candidate TFBS (Figure 1b). Although there are certain problems inherent in the use of PWMs, they are nevertheless a good approximation and a useful representation that can identify biologically interesting candidate sites [23–26]. Furthermore, even though the binding of a TF in vitro can be predicted accurately from a large set of experimentally defined binding sites, such predicted sites may not serve a direct regulatory function, or even be bound, in vivo. Stormo and Fields  have said that "this is not a failure of the computational techniques, but rather reflects biological reality: competition, chromatin structure and other influences are as important as binding affinity".
A number of collections of experimentally defined TFBSs have been assembled. The largest and most commonly used collection is the TRANSFAC database , which catalogs eukaryotic TFs and their known binding sites, and provides PWMs. Likewise, a number of tools, such as MatInd and MatInspector , MATRIX SEARCH , SIGNAL SCAN , and rVISTA , have been developed to allow the user to search an input sequence, such as a genome of interest, for matches to a PWM or a library of PWMs. In addition to motif-match searching, genes can also be classified according to whether they are likely to be regulated through a particular motif or combination of motifs, such as by using Hidden Markov Models  to statistically model the number and relative locations of TFBSs within a sequence .
The prediction and experimental identification of regulatory regions in higher eukaryotes is more difficult than in model organisms with smaller genomes, partly because of the larger genome size, because a larger portion of higher genomes is noncoding, and because even the general principles governing the locations of DNA regulatory elements in higher eukaryotic genomes remain unknown. For example, regulatory elements can be found far upstream of coding regions, within introns, and even far downstream of the genes they regulate, making the search for them difficult. Given this large sequence space in which to search, methods of enrichment are necessary for an efficient search.
One method to enrich for shared sets of candidate regulatory elements is to focus on the noncoding sequence surrounding genes that have very similar mRNA expression patterns. A number of studies have been successful in extracting sequence motifs from expression data or groups of functionally related genes in yeast [35–39]. Extracting candidate regulatory motifs in this manner from a single genome's sequence becomes much more difficult in higher eukaryotes, however, because of the much greater amount of input sequence that must go into the motif search algorithms. This increased amount of input sequence increases the background noise levels in the motif search, making it more difficult to extract the true regulatory motifs. For these reasons, it has been suggested that comparisons between genomes be incorporated into the searches of higher eukaryotic expression clusters for regulatory motifs, as an additional method for further enriching for likely regulatory elements .
A major method for enriching for candidate regulatory elements is to identify regions of sequence conservation between genomes, as it is these conserved regions that are likely to contain important regulatory sites. This method of performing phylogenetic comparisons to reveal conserved cis elements in the noncoding regions of homologous genes is referred to as 'phylogenetic footprinting' . It has been described as searching for "islands of conserved sequences in seas of less conserved noncoding sequence" .
An important first step in phylogenetic footprinting is to identify orthologs, genes in different species that are derived from the same gene in the last common ancestral species and thus usually have similar functions in the genomes being compared. In contrast, paralogs are duplicate gene pairs within a genome that have diverged and typically have different functions. Orthologs need to be distinguished from paralogs, because it can be expected that as the functions of paralog has diverged, their transcriptional regulators may also have diverged. At relatively close evolutionary distances - divergence around 40-80 million years ago (Mya) - it can be difficult to distinguish between undiscovered coding sequences and functional noncoding sequences, so comparison with distantly related species can improve the ability to distinguish these classes of conserved sequences . Frazer and colleagues [42, 43] have reviewed methods for cross-species sequence comparisons.
With the development of improved sequencing technologies, the cost of sequencing has dropped significantly, making genome-scale comparative sequence analysis projects possible. In the initial sequencing and comparative analysis of the mouse genome, Waterston and colleagues  found that at the nucleotide level approximately 40% of the human genome can be aligned to the mouse genome (which diverged around 75 Mya), and that about 80% of mouse genes have a single identifiable ortholog in the human genome. By examining the extent of genome-wide sequence conservation, they determined that a much higher fraction of short segments in the mammalian genome are under selection than can be explained by protein-coding sequences alone .
In a comparison by Loots and colleagues  of 1 megabase (Mb) of orthologous human and mouse sequences surrounding the interleukin genes IL-4, IL-13, and IL-5, 90 conserved noncoding elements with at least 70% identity over at least 100 bp were discovered. Analysis of a subset of these elements indicated that many were highly conserved in at least two mammals in addition to humans and mice. Many of the conserved noncoding sequences were found in clusters, suggesting that they may work cooperatively. Subsequent in vivo characterization of the largest element ('CNS-1') in mice revealed it to be a coordinate regulator of IL-4, IL-13, and IL-5 . Although no experimental verification is available on the remaining 89 conserved noncoding sequences, these findings give hope that similar genomic comparisons will be fruitful. A similar set of studies on human-mouse pairwise sequence comparisons surrounding the stem-cell leukemia locus (SCL) identified known and predicted SCL enhancers [46–48].
The pufferfish Fugu rubripes has been considered as a particularly useful species for cross-species genome sequence comparisons  because, unlike mammals, it has a compact genome . For similar reasons, the human genome has also been compared with the chicken genome (which diverged about 300 Mya); about 30-50% of genes in the chicken genome are concentrated in minichromosomes with gene density approaching that of the pufferfish . It is important to remember, however, that the species that are compared will determine what kinds of functional elements can be found (primate-specific, mammal-specific, and so on). For example, only 16% of orthologous genes between mammals and bony fishes (which diverged about 450 Mya) contain conserved elements in their noncoding regions, so mammal-specific elements are unlikely to be found through fish-human comparisons . These findings question both the utility of sequence comparisons beyond mammals in thoroughly identifying gene regulatory elements and the correct criteria for identifying conserved noncoding sequences.
In prokaryotes and yeast, motif-finding studies generally need to search only a few hundred base-pairs upstream of predicted translational start sites [36, 37, 52]. In higher eukaryotic genomes, however, transcriptional start sites can be kilobases away from the translational start sites , so identification of the start site is an important task in order that searches of upstream sequence can be focused on noncoding sequence upstream of 5' untranslated regions (UTRs; for reviews see [51, 54]).
The next important algorithmic decision is whether to perform local or global sequence alignments in order to identify regions of sequence homology . Whereas local alignments are computed to produce optimal similarity between subregions of the sequence, global alignments are computed to produce optimal similarity over the entire length of the two sequences being compared. Various alignment algorithms have been developed that permit pairwise or multiple alignments of sequences . The program rVISTA performs global alignment of genomic sequences and then searches within the conserved regions for conserved TFBSs matching known PWMs . One limitation of this approach is that certain TFBSs may be located in regions not conserved at sufficiently high levels to be identified as conserved by rVISTA parameters. Likewise, the choice of which alignment method to use, and thus the resulting genomic sequence alignments, can also have profound effects on which potential cis-regulatory elements are found. Of note is a pairwise comparison of D. melanogaster and D. virilis (which diverged about 40 Mya), in which it was found that that the majority of discordant blocks are missed uniquely by only one of the three alignment methods used . Thus, the use of more than one alignment method may be beneficial for the most complete identification of candidate cis-regulatory elements.
In addition to considerations regarding which genomes to compare and how to align them, there is the additional issue that the level of sequence conservation varies widely across genomes. In a comparison of orthologous human and mouse sequence, Koop and colleagues [57, 58] found variable levels of sequence similarity, with high levels of similarity in the T-cell receptor locus and the α and β myosin genes, and very low levels in the γ-crystallin, XRCC1, and β-globin gene clusters. These and other findings [43, 57–60] suggest that different regions of the genome evolve at different rates. Thus, using fixed percentage identity cutoffs across entire genomes for considering regions conserved is likely to result in too much sequence being identified as functionally conserved in some regions and too little functionally conserved sequence being identified in other regions . Reviews are available on strategies and resources for finding regulatory elements in mammalian genomes [40, 42, 62], the theory behind various alignment algorithms , and algorithms for phylogenetic footprinting, including the development of an algorithm that makes use of the phylogenetic tree underlying the data . In addition, the annual Nucleic Acids Research Web Server Issue  includes tools for analysis of gene-expression data, prediction of cis-regulatory modules, sequence alignments, promoter prediction, and discovery and identification of candidate TFBSs, and the annual Nucleic Acids Research Database Issue  includes nucleotide sequence databases, comparative genomics databases, gene-expression databases, and various protein databases.
TFs associated with expression specific to skeletal muscle have been studied extensively, probably as a result of good cell-culture models for differentiation. Wasserman and Fickett  have created a TFBS database derived from a literature search for experimentally defined TFBSs for five TFs associated with skeletal-muscle-specific expression: Mef-2, Myf, Sp1, SRF, and Tef. In searching the Eukaryotic Promoter Database (EPD) , they found that high-scoring sites occurred more frequently in sequences linked to muscle-specific expression . In a comparison of 28 orthologous human-mouse gene pairs that are specifically upregulated in skeletal muscle, Wasserman's group  found that 98% of experimentally defined sequence-specific binding sites of TFs specific to skeletal muscle are confined to the 19% of human noncoding sequences that are most conserved in the orthologous rodent sequences.
In higher eukaryotes, TFs frequently bind DNA within segments of sequence, typically hundreds of base-pairs long, termed cis-regulatory modules or enhancers. A given gene can have multiple such modules in its surrounding noncoding sequence; they typically direct expression in either a cell-type-specific or temporal-specific manner . Typically four to eight different TFs bind within an enhancer, and each factor can bind to multiple sites within it [53, 70] (for reviews on transcriptional regulation in metazoans, see [69, 70]). Because pairs of sites may correspond to TFs that coregulate expression of the nearby gene(s) , a number of approaches have been developed to identify pairs of binding sites [72–78]. For example, one study focusing on the MEF2 and MyoD families of TF found that where the two bind in the same regulatory region, their binding sites occur at precise distances relative to the helical turn of DNA, and thus probably allow cooperative protein-protein interactions . Although some TFs may require specific distances between their binding sites for cooperative binding, it has been thought that in many cases the exact spacing and order of TFBSs is not important for enhancer function .
More recently, approaches have been developed to identify higher-order site clusterings [81–93]. Such clusters can be homotypic, containing multiple sites for one particular TF, or heterotypic, containing one or more binding sites for multiple TFs . A search of vertebrate genomic sequence revealed that sites bound by the liver regulatory TF hepatocyte nuclear factor 1 (HNF1) occurred more frequently in hepatic genes than expected by chance, that HNF1-binding sites in liver genes are more often associated in clusters with sites for other TFs than expected by chance, and that the enrichment is more pronounced in promoter regions . In a search for matches to TRANSFAC PWMs within conserved noncoding sequences surrounding a set of human and mouse genes, conserved segments in upstream regions contained TFBS pairs colocalized in a manner consistent with experimentally known pairwise co-occurrences of TFs .
In a recently published study, Wasserman and colleagues  performed human-mouse sequence comparisons of 14 well-studied genes and searched for matches to TFBS PWMs within the conserved noncoding regions, using a range of PWM score thresholds. The choice of PWM score cutoffs is a critical issue in all predictions of sites from PWMs, as the requirement for a more stringent match (a higher cutoff) is likely to result in fewer false-positive predictions but can potentially result in more sites being missed (false negatives). The same kind of problem occurs when conserved regions are used: the assumption is that fewer of the motif 'hits' will be false positives than when searching the whole genome, but a greater number of functional sites may be missed because they occur outside conserved regions. Considering regions with 70% sequence identity and a 75% relative matrix score threshold, Wasserman and colleagues found that 66% of previously verified TFBSs were detected with phylogenetic footprinting, compared with 73% when just single sequences were scanned. At a 60% matrix score threshold, looking just within the conserved regions, they were able to detect 83% of TFBSs  (although one has to keep in mind that decreasing the PWM score threshold will increase the number of likely false-positive hits).
The yeasts are good organisms for phylogenetic footprinting because the complete S. cerevisiae sequence has been available for quite some time now, Saccharomyces genomes are relatively small and have relatively compact noncoding sequences (about 30% of the genome is noncoding), their phylogeny is well-characterized (with many related species at various evolutionary distances), and because of the ease of experimental validation in yeast. Yeast strains closely related to S. cerevisiae can be divided into three sub-groups: Saccharomyces sensu stricto, Saccharomyces sensu lato and petite-negative (these last two sub-groups have fewer chromosomes and are significantly different physiologically from S. cerevisiae). In a key paper, Johnston and colleagues  described their survey of a number of orthologous genomic loci in seven yeast strains from these sub-groups, in order to evaluate which genomes would be most useful for identifying conserved TFBSs in promoter regions. As an example, for Gal4 and Mig1 TFBSs, they saw conservation not just of TFBS sequences, but also of spacing, in sensu stricto species, but this conservation was not seen in sensu lato species. Looking forward, the authors identified the problem of balancing the need to align orthologous sequences with the aim of having the functional elements stand out .
Subsequently, the same group  sequenced the genomes of three sensu stricto strains (S. mikatae, S. kudriavzevii, and S. bayanus) and two more distantly related strains (S. castellii and S. kluyveri), and performed both four-way genome sequence alignments over just the sensu stricto strains and also six-way alignments over all the sequenced strains, including S. cerevisiae. They restricted their search of the multi-species genome sequence alignments for sequences of length 6-30 bp with no gaps (that is, there is no nucleotide within the site for which there is no sequence preference), and required motifs to be 100% conserved across all species under consideration and found in the upstream regions of at least five genes. They chose to focus on ungapped sequences because of their observation that most characterized sequence motifs do not have gaps. In addition to identifying most characterized ungapped motifs that met their stringent criteria, Johnston's group  also identified 79 unique unknown conserved elements of length 6-30 bp with no gaps, with some evidence for functionality, as characterized by correlation with functional category enrichment using Munich Information Center for Protein Sequences (MIPS) annotation , mRNA expression coherence, or correlation with ChIP-chip data.
In a similar study, Lander and colleagues  included an elegant analysis focused on identifying known and novel candidate regulatory motifs. They limited themselves to comparing four sensu stricto species: S. cerevisiae, S. paradoxus, S. mikatae, and S. bayanus; there was an overlap of three species with the eight species examined by the Johnston group . The primary assumption  in choosing these species was that they should represent as narrow a taxon as possible (in contrast to the approach of Johnston's group ), as identified motifs must be common to all species. To put these comparisons into perspective, the sequence divergence between S. cerevisiae and the most distant of these four species, S. bayanus, is similar to that between human and mouse, although there is an inherent difference in signal-to-noise ratios in the genomes because of the differences in gene density (yeast genomes are about 30% coding whereas the human genome is about 2% coding) and the ratios of presumably non-regulatory noncoding sequence (whereas in yeast about 15% of intergenic regions are regulatory elements, in human only about 3% of noncoding regions are regulatory elements) .
Using these various enrichment scores as filters, the authors  identified 72 full motifs, 42 of which did not match previously described regulatory DNA motifs in yeast. Most of the motifs were found preferentially upstream of genes, although some did show enrichment downstream of genes. This is an interesting observation to keep in mind, given that many studies that aim to find regulatory DNA elements in yeast have searched only upstream of the target gene(s). Furthermore, the focus for finding regulatory elements is currently on noncoding sequences. There is a general lack of data on the function of TFBSs within coding regions, although one recent ChIP-chip study on the yeast TF Rap1 found that binding sites within coding regions were much less likely to be bound in vivo . As this study  was performed on just one TF, however, it is unclear how general the observation will be.
Nevertheless, even in these high-resolution genome sequence comparisons, not all known motifs were found by either genome-wide or category-based analysis. Interestingly, some motifs appeared to define previously unknown binding sites associated with known TFs. Some motifs did not match regions bound by known TFs but showed strong functional category correlation; these motifs are potential binding sites for thus-far undiscovered TFs and are reasonable candidates for directed experiments to identify what TFs may bind them .
Similar phylogenetic footprinting approaches have been taken to try to identify regulatory elements in the noncoding portions of other genomes. A comparison of the Escherichia coli and Haemophilus influenzae genomes led to the identification of a novel motif that had not been found previously in any of the individual genomes, and to the discovery of new members of known regulons . In a search within alignments of a set of orthologous intergenic regions from the Caenorhabditis elegans and Caenorhabditis briggsae genomes (which are 23-40 Mya apart), an uneven distribution of short conserved sequence blocks was found across the genomes, again suggesting the potential co-occurrence of TFBSs within transcriptional enhancers . In an analysis of conservation over four Drosophila species spanning a range of divergence times, it was also found that conserved noncoding sequences tend to cluster spatially, with conserved spacing between them, and that there is a strong tendency for known cis-regulatory elements to overlap clusters of conserved noncoding sequences . Such clusters may correspond to functional interactions among transcriptional enhancers.
In a landmark paper examining enhancer function in Drosophila, Ludwig and co-workers  found that in a comparison of 13 species, none of 16 surveyed D. melanogaster TFBSs was completely conserved. They also observed differences in the spacing between TFBSs. Despite these differences between species, each enhancer drove reporter-gene expression at identical times and locations in the early D. melanogaster embryo. Chimeric enhancers did not recapitulate the wild-type expression pattern, however. The authors proposed that stabilizing selection has maintained phenotypic constancy, but has allowed mutation within the enhancer, and that substitutions within TFBSs and changes in the lengths of spacer regions between TFBSs would result in weak changes, with many functionally compensatory mutations. One of their significant conclusions was that this "may make it difficult to identify homologous elements in different species groups by sequence comparison alone" . This is an important observation to keep in mind in the development and application of algorithms for discovery in silico of transcriptional enhancers and TFBSs conserved across genomes, because conserved TFBSs may not necessarily occur within longer stretches of conserved sequence.
In an important recent study, Boffelli and colleagues  sequenced four different regions from over a dozen primate species, including Old World and New World monkeys and hominoids. The premise of their approach was that the human-mouse comparisons can fail to align meaningfully, and thus can fail to identify functional elements, and that the additive collective divergence of higher primates as a group is comparable to that of humans and mice . An additional consideration is that in comparing just human and mouse sequences there is the potential problem that some regions of the genome are highly conserved . In this 'phylogenetic shadowing' approach, they took into account the phylogenetic relationships of the analyzed species. The authors noted that the most informative subset of four to seven species can capture most of the discriminative power of the approach using the full set of species. Using gel-shift assays and luciferase reporter assays, they found that conserved regions were bound by protein more frequently, and thus were presumably more likely to be functional, than nonconserved regions .
In a similar study, Thomas and colleagues  compared sequences from 12 evolutionarily diverse vertebrate species, for sequences orthologous to a human chromosomal region containing 10 genes, including the gene mutated in cystic fibrosis (CFTR). The authors noted that the 'multi-species conserved regions' that they detected overlapped with 63% of the functionally validated regulatory elements in the CFTR genomic region, and that many of the remaining missed known regulatory elements may have been missed either because they are shorter than their approach could detect (< 25 bp), or because they are primate-specific. Interestingly, their results suggest that the power to detect multi-species conserved regions seems to depend mainly on the total divergence of the subset of species rather than on the particular distribution of the species among lineages, and thus that combined phylogenetic branch length may be a useful metric for guiding the selection of additional genomes to sequence.
Francis Collins has said  that further multi-species comparisons, especially those occupying distinct evolutionary positions, will lead to significant refinements in our understanding of the functional importance of conserved sequences and are thus crucial to the functional characterization of the human genome. Sidow  noted that identification of the majority of functional elements relevant to human biology requires placental genomes beyond those of human, mouse, and rat. Sidow commented that "Building a parts list is important, but multiple sequence alignments by themselves do not quantify conservation and allow only limited inference as to which conserved functional element is more constrained than another" .
In recent years, a number of efforts have been focused on attempting to predict TFBSs using structural information on the protein or related protein-DNA complexes. Some of these studies have attempted to determine what 'recognition rules' or 'recognition code' may exist that stipulate which DNA base-pairs are likely to be bound by which amino acids, in the context of a particular structural class of DNA binding proteins. These approaches have come either from analysis of databases of well-characterized DNA-protein interactions [108–112], from computer modeling [113, 114], or from experiments employing in vitro selection from a randomized library, either of the DNA base pairs or the amino-acid residues implicated in sequence-specific binding [115–117]. There is no obvious, simple code like the genetic code, however, and any recognition rules that might exist are likely to be quite degenerate and highly dependent upon the docking arrangement of the protein with its DNA binding site . This area of work, including the possibility of deciphering a 'probabilistic code', is discussed by Benos et al. . Such efforts will be greatly aided by the further development of high-throughput technologies for identifying interactions between TFs and their DNA binding sites, so that much larger datasets can be generated for analyses required to decipher any 'degenerate probability codes' or to be used as training sets for developing improved DNA binding-site prediction algorithms. Similarly, the lack of a sufficient set of TFs of well-characterized DNA-binding specificities has also resulted in the lack of a good test set for the evaluation of new algorithms aimed at predicting transcriptional enhancers.
There are predicted to be around 1,850 TFs in the human genome , but only a very small fraction of them have well-characterized binding specificities. The challenge will be to characterize these specificities, so that their target genes and potential combinatorial modes of transcriptional regulatory control can be discovered. Studies using the various high-throughput technologies described earlier will permit a better understanding of the locations and organization of regulatory DNA elements in higher eukaryotic genomes and the regulatory complexity resulting from combinatorial interactions of TFs. Finally, there is a need for the development of high-throughput transgenic bioassays for validating predicted enhancers, as experimental verification of predicted cis-regulatory elements is currently another major limiting step. The combination of these different kinds of transcription-factor binding-site data, together with mRNA expression analysis, protein-interaction databases and prior genetic and biochemical data in the literature, will allow the construction of more detailed connectivity maps of transcriptional regulatory networks [10, 13, 121–125].
I thank Mike Berger, Anthony Philippakis, and Pete Estep for helpful comments on the manuscript. M.L.B. was supported in part by an Informatics Research Starter Grant from the PhRMA Foundation, a Taplin Award from the John F. and Virginia B. Taplin Foundation, and a Harvard Medical School William F. Milton Fund Award.