Identifying transcriptional targets
© BioMed Central Ltd 2004
Published: 27 February 2004
Skip to main content
© BioMed Central Ltd 2004
Published: 27 February 2004
Identifying the targets of transcription factors is important for understanding cellular processes. We review how targets have previously been isolated and outline new technologies that are being developed to identify novel direct targets, including chromatin immunoprecipitation combined with microarray screening and bioinformatic approaches.
The control of many cellular processes requires the coordinated activation or repression of genes in the correct spatial and temporal patterns. This regulation is carried out in large part by transcription factors, which bind to DNA sequences within chromatin and activate or repress the transcription of nearby genes. This binding is frequently sequence-specific, with sequence recognition being carried out by the transcription factor itself or by other proteins complexed to it. Identification of the targets of each transcription factor provides information about individual processes and how transcription factors interact in a transcriptional network. These networks can then be used to describe a particular cellular process, or even something as complicated as embryonic development [1, 2].
The first step in identifying targets of a transcription factor usually involves overexpression or knockdown of the factor in question and analysis of the resulting changes in gene expression. The development of microarray technology has facilitated this kind of analysis, allowing identification of many more downstream genes than was previously feasible. But this method gives no information about whether targets are regulated directly by the transcription factor through binding to regulatory sequences within the gene or whether regulation is indirect, through the activation of intermediate genes. Other techniques, such as chromatin immunoprecipitation (ChIP) and Dam methylase identification (DamID), have therefore been developed. These reveal where in the genome the transcription factor is bound; these approaches allow identification of many direct target sequences, particularly when it is combined with microarrays of genomic DNA. This type of information, in combination with genomic sequences, is now being used to develop computational algorithms that scan genomic sequence with the aim of distinguishing functional binding sites and target genes of transcription factors.
Such approaches have their limitations, however. Overexpression or misexpression of a transcription factor may not lead to up-regulation of its target genes if transcription is tightly controlled, or alternatively it may lead to indiscriminate activation of other genes that are not usually activated by the transcription factor under physiological conditions. On the other hand, knockdown of a transcription factor may cause embryonic or cellular lethality, or there may be redundancy with another factor so that bona fide target genes are not downregulated and therefore may not be identified. Nevertheless, these methods have been used successfully to identify transcription-factor target genes (see, for example, [3, 4]).
Once putative target genes have been identified, they are often verified by examination of their expression pattern in tissues or whole organisms, since direct targets are expected to be activated in the regions where the transcription factor is expressed. Expression of putative target genes can also be compared between wild-type and mutant systems, as targets should not be expressed in the absence of the transcription factor (see  for example).
These methods identify only a limited number of targets, but more recently high-throughput techniques have allowed the identification of many more. Projects for sequencing both genomic DNA and expressed sequence tags (ESTs) have led to the development of expression microarrays, which enable simultaneous screening of most or all of the transcriptome and thus increase the number of targets that can be easily identified through comparisons of mRNA populations. In such experiments, RNA from each of the two cell populations, as described above and in Figure 1, is labeled with a different fluorescent dye. The RNA is then mixed and hybridized to microarrays, consisting of cDNAs or oligonucleotides arrayed on glass slides. The fluorescence intensity of each dot, which corresponds to one gene, can be measured and correlated to a change in expression of each gene . For example, circadian gene expression in Drosophila, which is at least partially controlled by the Clock (Clk) transcription factor, was recently analyzed using microarrays . Comparison of gene expression in wild-type and clk mutant flies led to the identification of 134 genes that require Clk for expression and whose expression levels cycle over 24 hours in wild-type flies .
In addition to giving increased numbers of potential transcription-factor targets, the ease with which large numbers of genes can now be investigated allows comparison of more than two different conditions, giving a clearer indication about which genes may be direct targets. For example, to identify targets of the sterol-regulatory-element binding protein (SREBP) genes in mice, Horton et al.  compared gene expression in the livers of one knockout strain and two transgenic strains of mice that overexpress different forms of SREBP. They applied stringent combinatorial criteria to identify direct targets, restricting themselves only to genes that were upregulated in both transgenic lines and downregulated in the knockout line. As a result, 33 genes were identified by this method, only 38% of the genes that would have been identified by comparing just two of the strains. Although this combinatorial method clearly increases the likelihood of predicting a direct target, other methods must be used to be more confident of a direct interaction of the transcription factor with the target gene.
A variety of methods can be used to identify targets that are likely to be regulated directly by a transcription factor. Timing is one criterion: for example, immediate early genes, which are switched on shortly after the activation of a transcription factor, are more likely to be activated directly by that factor, because there has been little time for another gene to be activated and then for that to activate the target gene. This type of analysis is facilitated by the use of inducible gene expression, so the precise moment at which the transcription factor is activated and able to induce expression of downstream genes is known .
This technique can be further improved by the use of protein-synthesis inhibitors, such as cycloheximide. Transcription factors that are already present within the cell are able to activate the expression of their target genes, but in the presence of cycloheximide the target genes cannot be translated, and so cannot switch on further downstream genes as indirect targets. Thus, only those genes upregulated in the presence of cycloheximide are direct targets . For instance, although microarray expression analysis identified 134 targets of Drosophila Clk, expression of a hormone-inducible form of Clk in cell culture in the presence of cycloheximide indicates that only nine of the genes are in fact direct targets .
These methods provide further evidence that a target is direct but do not show that the transcription factor binds directly to a regulatory sequence in the gene; this can be tested by other approaches, such as the electrophoretic mobility shift assay (EMSA). This technique identifies binding of specific proteins to DNA sequences, and so can demonstrate direct binding of a transcription factor to the promoter region of its target gene . This in vitro method may not accurately reflect the situation in vivo, however, as binding is likely to be less tightly regulated in the assay.
One approach, which was first described for Saccharomyces cerevisiae but has since been applied to human cell lines [15, 16, 18–21], has extended the ChIP protocol to the analysis of immunoprecipitated DNA with genomic microarrays (see Figure 2; reviewed in more detail in [22, 23]).
The design of microarrays differs between different research groups and between organisms. For S. cerevisiae, which has a small and relatively simple genome containing approximately 6,200 genes, it is possible to design microarrays containing all yeast intergenic regions [15, 16] in addition to coding regions [15, 24]. Designing microarrays for human studies is more difficult, because higher eukaryotes have a more complex genome and more complex mechanisms of gene regulation. Unlike yeast, where the majority of transcription-factor-binding sites are found in upstream proximal promoter regions [15, 24], higher eukaryotic gene expression is also controlled by factors binding at enhancer sequences located many kilobases from the gene. These enhancers may be situated 5' or 3' relative to the gene, in introns or even occasionally in exons (see below).
Initial studies of transcription-factor binding in human cells concentrated on E2Fs, a family of transcription factors that play a role in cell-cycle progression and proliferation [16, 18]. Thus Ren and colleagues  designed arrays containing sequences upstream of 1,444 genes available from the human genome sequence, about 1,200 of which had previously been identified as cell-cycle-regulated. As more human genome sequence data and annotation has become available, however, the Ren and Young labs have now produced microarrays containing 6,000 and 13,000 sequences, again consisting mostly of 5' proximal sequences [21, 25]. A different approach was taken by Weinmann and colleagues  who arrayed 7,776 human genomic fragments enriched for CpG islands, which are generally associated with upstream regulatory regions in vertebrates (reviewed in ).
Although such approaches are very powerful, one drawback of intergenic arrays is that they are biased by the design. In particular, 5' upstream sequence arrays will not detect interactions in introns, downstream sequences, non-annotated genomic regions, or exons. To overcome this bias, another group has designed a microarray containing the non-repetitive sequence of human chromosome 22 . They then used this array in a ChIP assay to analyze binding of the p65 subunit of NF-κB when cells were stimulated with tumor necrosis factor (TNF) α. This approach not only identified new targets for p65 on chromosome 22, but also revealed binding sites in areas of the chromosome that are currently not annotated. Although costly, this technique could be extended to the other chromosomes as more completed human chromosome sequences become available, and in this way an unbiased view of genomic binding-site architecture can be built up.
DNA isolated from DamID experiments has also been used to probe microarrays (Figure 2b). In the first reports of using this technique in Drosophila, cDNA arrays were used [17, 28]. More recently, however, Sun and colleagues have used a microarray spotted with contiguous regions of Drosophila chromosomes 2 and 3 for this analysis , and it should not be long before genomic arrays are also commonplace when using this technique.
One interesting picture that is emerging from these genome-wide location analyses is the pattern of transcription-factor binding across the genome. Several studies have searched for consensus binding sites for a particular factor using bioinformatic approaches (see below), and such sites have been found scattered throughout the genome, in both intergenic and coding regions (see, for example, [15, 24]). Genome-wide location analysis reveals, however, that only a subset of these sites is actually bound in vivo. This could be because binding-site recognition may be influenced by transcription-factor binding partners or by chromatin structure. For instance, when the binding of yeast transcription factors, Swi4 and Rap1, was analyzed using arrays containing both intergenic and coding regions of the genome, most binding sites were found to be in the proximal promoter regions of genes, and very few in coding sequence [15, 24]. In human cells, when binding of p65 was analyzed across chromosome 22, 28% of binding sites were found within 5 kilobases upstream of the translation start codon, 40% were found in intronic regions, and less than 1% of sites (2/209) were found in exons . To date, such observations have been made for only a small number of factors and it will be interesting to see how the results for other factors compare.
Ideally, we would like to be able to predict the expression pattern of a gene from its regulatory sequences. Are we moving towards a time when bona fide regulatory sequences bound by transcription factors can be identified in silico? Databases of consensus transcription-factor-binding sites have been assembled over the last decade and computational algorithms that operate ab initio have been developed in an attempt to identify transcription-factor-binding sequences across the genome (see [30–32] for more detailed information). The programs exhibit different levels of stringency depending upon the algorithms used, but because they rely only on sequence data all are subject to false positives and false negatives. This is because transcription factors do not bind to all instances of their consensus binding site, as outlined above, and may also bind to other sequences that vary from the known consensus sequence (see below).
The development of computational algorithms has been improved by comparative genomics, or phylogenetic footprinting (for example , reviewed in ). This approach is based upon the fact that non-coding sequences that are highly conserved between species are much more likely to be involved in gene regulation. But difficulties arise in identification of organisms that are significantly closely related for regions to be conserved but sufficiently divergent for this conservation to be significant.
In order to improve the reliability of computational prediction of functional binding sites, other information, often derived from experimental studies, must be included in the analysis (see [31, 32, 34]). A common method involves comparing the promoters of genes co-regulated by a transcription factor to identify conserved motifs. Recently, targets of Dorsal, a transcription factor involved in specifying the dorsoventral axis in Drosophila, were identified using expression microarrays, and subsequent analysis identified up to 40 targets that have the expected restricted expression pattern in the embryo . Examination of the genomic sequence around a subset of these target genes discovered that consensus Dorsal-binding sites generally cluster together, either upstream of the start codon ATG or within introns . A computational algorithm was developed from this information and used to scan the rest of the Drosophila genome, identifying 3 known Dorsal target genes and 15 new putative targets . Two of these targets have been tested and found to exhibit asymmetric expression patterns across the dorsoventral axis, as would be expected for Dorsal target genes (, reviewed in ).
Similarly, Kel et al.  were able to identify composite modules consisting of clusters of binding sites for E2F and other transcription factors that are involved in the regulation of known E2F targets. Examination of these regulatory sequences led to the identification of a range of characteristic motifs in addition to the known binding sites. Using this information, computational methods were then developed to search the promoter regions of cell-cycle-regulated genes. This led to the identification of 29 genes known to be regulated by E2F, plus an additional 313 putative E2F targets that contained the identified upstream regulatory modules. Some of these putative targets have now been confirmed as direct targets by ChIP analysis .
Interestingly, in those ChIP-array studies where it has been examined, a proportion of sequences bound to transcription factors did not contain the known consensus binding site for the transcription factor tested. For example, Iyer et al.  found that in S. cerevisiae about half of the targets of the transcription factors MBF and SBF do not contain the consensus binding sites for the factors. In human cells, Ren et al.  and Weinmann et al.  found that up to 25% of identified E2F targets did not contain the E2F consensus site. Further characterization revealed that some of these target genes are repressed rather than up-regulated by E2F . Although no sequence that is common to these repressing regions has yet been described, applying computational techniques may reveal such a site. Thus, genome-wide location analysis combined with computational analysis may be useful in identifying previously unknown binding sequences for other transcription-factors.
The development of high throughput methods for the identification of direct transcription-factor target genes has led to a large increase in our understanding of combinatorial networks of gene regulation. The combination of genome-wide expression data with genome-wide location analysis constitutes a powerful tool not only in verifying predicted interactions, but also in elucidating transcriptional networks. Simon et al.  performed genome-wide location analysis with the nine known cell-cycle activators in yeast and showed that cell-cycle transcriptional control is a connected network. For example, transcriptional regulators that act at one stage of the cycle to up-regulate genes promoting cell-cycle progression also up-regulate the transcription of factors that act during the next stage of the cycle. This group has since extended its analysis to (nearly) all yeast transcription-factors . This has identified simple network motifs (the building blocks of a network) that have been used to describe networks controlling, for example, metabolism and the response to mating factor [20, 39]. As these kinds of analyses become more commonplace, we can look forward to a time when each transcription factor can be placed in a network that describes a complex cellular process, such as those that led to the development of an embryo.