Prediction of synergistic transcription factors by function conservation

A new strategy is proposed for identifying synergistic transcription factors by function conservation, leading to the identification of 51 homotypic transcription-factor combinations.


Background
The expression of genes is regulated by transcription factors (TFs), which interact with the basic transcription machinery to activate or repress transcription after binding to TF binding sites (TFBSs; also called cis-acting elements) in target genes and interacting with other DNA binding proteins. In eukaryotic organisms, transcriptional regulation of a gene's spatial, temporal, and expression level is generally mediated by multiple TFs [1][2][3]. Therefore, the identification of synergistic TFs and the elucidation of relationships among them are of great importance for understanding combinatorial transcriptional regulation and gene regulatory networks.
Currently, the identification of synergistic TFs comes predominantly from two general approaches. The first is by the use of experimental data such as gene expression, chromatin immunoprecipitation (ChIP)-chip, and protein-protein interaction data. For this approach, the majority of studies analyzed gene expression data across a variety of experimental conditions to infer synergistic relationships between TFs [4][5][6][7][8][9]. Statistically significant motif combinations are predicted based on stronger co-expression patterns regulated by two or more TFs than the expression patterns regulated by a single one. With the advance in protein-DNA binding assays [10], researchers have also integrated ChIP data with microarray expression or protein-protein interaction data to infer synergistic binding of cooperative TFs [11][12][13][14]. The second commonly used approach is computational identification of TF combinations. In this case, synergistic TFs were predicted by either enrichment analysis of co-occurring TFBSs on the upstream sequences of genes relative to appropriate background sequences or by comparative genomics using phylogenetically conserved sequences between closely related species [15][16][17].
Despite the success of these approaches, both have limitations. The approach based on experimental observation needs a priori knowledge, such as gene expression patterns in a certain tissue, which restricts synergistic TF determination to those tissues or cells studied and thus prevents the discovery of TF combinations from multiple biological conditions. Conversely, computational approaches can predict TF combinations on a large scale, but they usually lack the ability to functionally annotate synergistic TFs. Furthermore, methods based on phylogenetically conserved sequences, although they can greatly reduce the false prediction rate [18], have limitations related to missing potentially significant observations. Moreover, if the species are very closely related, nonfunctional sequences may not have diverged enough to allow functional sequence motifs to be identified; conversely, if the species are distantly related, short conserved regions may be masked by nonfunctional background sequences.
In the current study, rather than utilizing these traditional approaches, we propose a novel strategy to identify TF combinations by function conservation, which can be implemented at two levels. The first is functional conservation of TFs between species. Based on the strong possibility that each specific TF plays the same role in regulating gene expression between closely related species, the occurrence of its binding sites is expected to be more highly enriched in promoter sequences of orthologous genes than in promoter sequences of non-orthologous genes. The second is functional conservation of TFBSs between promoter sequences of individual orthologous genes. For identifying TF combinations, the general pattern of TFBS arrangement on promoters of orthologous genes is most likely more important than the precise positions of the binding sites [19]. To apply these concepts to synergistic TF discovery, it is important to develop appropriate computational approaches that are able to integrate function conservation from both TFs and TFBSs with analytical methodologies. We thus utilized human and mouse orthologous promoter sequences to first enrich TFBS combinations with distance constraints on a genome-scale and subsequently performed enrichment analyses of common orthologous genes (that is, genes that overlapped between mouse and human with particular homotypic TFBSs) whose regulatory sequences contain the identified TFBS combinations. We then integrated the function conservation from both levels by using Pearson correlation coefficients.
Genome-wide promoter analyses have led not only to the development of computational approaches but also to the identification of 51 homotypic TF combinations using known TFBSs from precompiled position weight matrices (PWMs) in the TRANSFAC database [20]. As a first step toward discovering functional TF networks, we have further used the developed computational approaches to predicate interactions between heterotypic TFs (that is, two different TFs). The strength of this proposed strategy, as opposed to the other described methods, lies in the fact that this strategy does not depend on sequence alignment, but rather genome information, for the discovery of functionally conserved TF combinations. Therefore, TF combinations with different functions can be obtained simultaneously, which is a key first step towards identifying functional TF networks.

Strategy overview
The overall analysis procedures are shown in Figure 1. The input data comprised more than 10,000 human and mouse orthologous promoter sequence pairs from the Database of Transcriptional Start Sites (DBTSS) [21]. To incorporate functional conservation of TFBS combinations into the analysis, we first performed a genome-wide search to obtain all potential TFBSs for each individual promoter sequence using the Match ® program and 234 unique PWMs from the professional TRANSFAC 9.1 database [22]. We then employed distance constraints to select co-occurring TFBSs in individual promoter sequences. The degree of enrichment for TFBS combinations was computed and represented as LOD co scores, which represent the frequency of co-occurrence for particular TFBSs in promoter sequences with respect to random expectation for the co-occurrence of the same TFBSs (see Materials and methods). The assumption behind this enrichment analysis is that random co-occurrence of TFBSs has less or no distance constraint when compared to functional TFBSs, although specific distance constraints may vary for different TFBSs. To incorporate functional conservation of TFs into the analysis, we estimated the degree of enrichment by using the hypergeometric distribution, which was represented as LOD og scores, for overlapping human and mouse orthologous genes whose promoter sequences contained the enriched TFBS combinations.
The integration of function conservation from both levels was achieved by the estimation of correlation between LOD co and LOD og . We hypothesized that if the enriched TFBS combinations had functional significance, then the enrichment of common orthologous genes would correlate with the LOD co scores from both human and mouse promoter sequences, since functional TFBSs are expected to be highly conserved between orthologous gene promoter sequences from closely related species. The degree of correlation would therefore allow us to identify combinatorial TFs that potentially regulated genes in a synergistic fashion. For the selection of significant correlations, we performed permutation tests to obtain p values, which were used to set up filtering criteria for multiple tests. Functional TF combinations were predicted based on p value cutoff threshold (q-value < 0.05) in both human and mouse and further validated by both known TF-TF interactions and function coherence based on Gene Ontology (GO) annotation of common mouse and human genes containing co-occurring TFBSs [23,24].

Enrichment of TFBS combinations and orthologous genes containing the binding sites
For the enrichment of functionally co-occurring TFBSs, we first employed 234 PWMs, which represent unique TFs in the TRANSFAC 9.1 database, to identify homotypic TFBS combinations (that is, two or more binding sites for the same TF on the same gene). As one of the important components of the approach, a total of 18 between-TFBS distances were defined and used to obtain co-occurring TFBSs from individual promoter sequences. Enrichment of TFBS combinations was estimated on a genome-scale by comparing co-occurring TFBS frequencies in known promoter sequences to those from random background sequences. Figure 2a,b show the overall enrichment results of TFBS combinations for 9 selected distance constraints and one without distance constraints from all 234 PWMs. A LOD co score > 0 exemplifies a higher frequency of TFBSs per promoter sequence when compared to background sequences. Thus, the larger the LOD co score, the greater the enrichment of a particular TFBS. The results show that the distributions of LOD co scores obtained from orthologous human and mouse promoters have similar patterns. Whereas the distribution of LOD co scores from the no distance constraint situation is significantly shifted in isolation to the left, LOD co score distributions from distance constraints are shifted to the right along with the smaller between-TFBS distances. Similar results were also obtained for enrichment of common orthologous genes containing the identified TFBS combinations, as can be seen from the LOD og distributions in Figure 2c.
We also performed further analyses to test the statistical significances of LOD co and LOD og score distributions from individual distance constraints using Wilcoxon signed-rank tests.
The results indicated that both median LOD co and LOD og scores from individual distance constraints were significantly larger than those from no distance constraint (p < 10 -15 ), further confirming the enrichment of co-occurring TFBSs and of common orthologous genes. It is important to note that median LOD co scores from individual distance constraints increase along with smaller between-TFBS distance ( Figure  2d), with p values ranging from 2 × 10 -4 to 2 × 10 -16 for human and from 5 × 10 -8 to 2 × 10 -16 for mouse. These findings are, however, not observed for LOD og scores (Figure 2d), for which no significant p values exist from the comparisons between adjacent distance constraints. These results suggest that not all enriched TFBSs represent functional TFBS combinations and, further, that synergistic interactions may not be applicable to every homotypic TF combination.

Integrating function conservation to identify TFs having synergistic interactions
Since functional co-occurrence may not be applicable to every TF and not all enriched TFBSs are functional TFBS combinations, it is therefore important to integrate function conservation from different levels to predict TFs that have synergistic interactions. We employed Pearson correlation coefficients to determine whether the 19 LOD co scores and their corresponding LOD og scores for each individual TFBS correlated with each other. Since functional TFBSs are highly conserved between orthologous gene promoters, we expect that a higher rate of overlapping orthologous genes whose promoters contain the co-occurring TFBSs indicates that the enriched cooccurring TFBSs represent functional ones from individual distance constraints. Therefore, correlations detect the agreement between TFBS enrichment and orthologous gene enrichment, no matter whether all the enriched TFBSs are functional ones or not. Figure 3a,b shows the overall distributions and frequencies of correlation coefficients from all 234 TFBSs for human and mouse. While correlation coefficients cover a broad range from -0.84 to 0.99, only a small portion of TFBSs display strong correlations for their LOD co and LOD og scores, which is in agreement with the conclusion from enrichment analyses.
To estimate the statistical significance of the correlations, we performed permutation tests using randomly paired LOD co with LOD og scores for each TFBS and utilized the resulting p values to set up a cutoff threshold for multiple analyses. Using a threshold q-value < 0.05, we were able to identify 51 homotypic TF combinations (Table 1) from both human and mouse, with p values ranging from 3 × 10 -3 to < 10 -4 and correlations ranging from 0.35 to 0.98. Of these 51 TF combinations, some have relatively smaller correlations when compared to the remaining TFBSs that were not selected because they did not meet the established threshold criteria (Figure 3a,b). This is because TFBSs with similar LOD co and LOD og trends along distance constraints have smaller p values from permutation tests; this trend is nicely illustrated by two TFs, one that met the threshold criterion (E2F1; Figure  4a) and one that did not (MYOGENIN; Figure 4b). Although correlations for E2F1 are smaller than those for MYOGENIN, E2F1 p values are much more significant than those from MYOGENIN. Closer observation indicates that LOD co and LOD og scores for E2F1 show a similar trend (Figure 4a), with both increasing as between-TFBS distance decreases. By contrast, LOD co and LOD og scores for MYOGENIN do not show this trend (Figure 4b), resulting in less statistical significance, even though LOD co and LOD og scores are highly correlated.
Further investigation indicated that overlapping human and mouse orthologous genes whose promoters contain predicted MYOGENIN binding site combinations had no functional association with MYOGENIN regulated genes based on GO analysis (see below), suggesting that MYOGENIN may not be a functional pair.

Evaluation by known TF-TF interactions
To assess the validity of the predicted TF combinations, we first used TRANSCompel ® professional version 10.4 to determine if known TF combinations were statistically enriched in the 51 identified TFs [23]. The TRANSCompel ® database contains approximately 180 experimentally proven composite elements of two or more binding sites; of these approximately 180 composite elements, 15 are synergistic combinations of homotypic TFs. Interestingly, 7 of these known combinations are in the 51 selected TFs, including CEBPB, CREB, E2F1, HNF1, HNF3B, OCT1, and PIT. To estimate the degree of enrichment, we performed a Fisher's exact test comparing the occurrence of known TF combinations in the 51 identified TFs to all 234 TFs. The results indicated that known TF combinations were significantly enriched in the 51 selected TFs (p = 0.035) compared to those TFs that did not meet the selection criterion (p = 0.59). These results indicated that our approach was able to identify to a great extent functionally co-occurring TFs, which exemplifies the validity of our methods.

Evaluation by function coherence
It is a well-established fact that TFs control cellular biological processes by targeting groups of genes encoding proteins with similar functions. Based on this fact, we performed function coherence analyses to determine if genes whose promoter sequences contained the co-occurring TFBSs had known biological functions associated with the TF predicted to bind to them. Two of the 51 selected TFs, namely E2F1 and NFAT, are of particular interest, as the genes that they regulate have well established physiological roles by previous studies. E2F1 regulates cell cycle progression via transcriptional regulation of Distribution of LOD co and LOD og from different distance constraints Figure 2 Distribution of LOD co and LOD og from different distance constraints. (a) LOD co distribution of 234 TFBSs from 9 selected distance constrains (for example, D20 stands for between-TFBS distance of 20 bp) and the one without a distance constraint (None) for human (hs). (b) LOD co distribution of 234 TFBSs from 9 selected distance constraints and the one without a distance constraint for mouse (mm). (c) LOD og distribution of 234 TFBSs from 9 selected distance constraints and the one without a distance constraint. (d) Median LOD co scores for both human (hs_LOD co ) and mouse (mm_LOD co ) and median LOD og scores.
Accordingly, we examined the functional association of genes with predicted synergistic TFBSs by looking for similar enriched GO biological process categories in overlapping human and mouse orthologous genes. This was done by first identifying the statistically over-represented GO biological process categories for genes whose promoter sequences contained the co-occurring TFBSs by DAVID [34], followed by looking for common GO biological process categories between human and mouse genes from the same distance constraint. Notably, genes whose promoter sequences contain co-occurring TFBSs display strong function coherence to the corresponding TFs binding to them, as shown in Table 2, in which enriched GO biological process categories and their p values from Fisher's exact tests are listed from eight distance constraints. These results indicate that identified genes with co-occurring E2F1 binding sites are involved in cell cycle control, sterol metabolism, and nucleotide and nucleic acid metabolism; notably, the biological process of cell cycle is over-represented at most distance constraints tested. In the case of NFAT, major over-represented biological functions include homophilic cell adhesion and immune response. As mentioned above, immune response is directly controlled by NFAT transcription factor. Overall, these results provide strong evidence for the functional co-occurrence of the identified TFs, and again exemplify the validity of our novel approaches.

Function annotation for the identified synergistic TFs
As mentioned above, it is well known that TFs control cellular biological processes via transcriptional regulation of groups of genes with similar functions. The roles of a particular TF in cellular processes can, therefore, be deduced from the known physiological functions of the TF's target genes.
To perform function annotation, while minimizing false positives, we first sought to identify distance constraints that had significant correlations between LOD co and LOD og scores from the 51 identified TFs. Accordingly, 10,000 random correlations were computed for each distance constraint using permuted LOD co and LOD og scores from the 51 TFs and used to estimate the statistical significance for real correlations. Correlations from human promoter analyses displayed significance for between-TFBS distances of 20 bp up to 90 bp, with p values ranging from 0.044 to 0.006 ( Figure 5). Although correlations from mouse promoter analyses were not significant, 8 distance constraints (between-TFBS distances of 20 and 90 bp for human and mouse) were nevertheless used for function annotation. Common human and mouse orthologous genes containing the synergistic binding sites were subsequently submitted to DAVID for GO analysis. The selection of biological process categories for TF function annotation was based on the following criteria: biological process categories are in common in at least five distance constraints between human and mouse; there exist at least five distance constraints in both human and mouse whose p values for the common biological process are less than 0.05.
Function annotation results are shown in Table 3, where potential biological functions for 38 synergistic TFs are listed (significant categories were not detected from the other 13 TFs). A brief search of PubMed revealed that annotated biological functions for at least 18 of these 38 synergistic TFs are in good agreement with previously reported findings by others . For example, earlier findings indicated that HNF1 was involved in the regulation of the expression of  Table 1 Correlations human organic anion transporter 3 [39], IRF in antiviral defense and immune activation [40], NRF2 in mammalian mitochondrial biogenesis [42], and Zic3 in neurogenesis [55]. All these results provide further evidence in support of our novel approaches.

Functional conservation of TFBSs obviates problems associated with phylogenetic footprinting
Unlike phylogenetic footprinting, which searches for conserved TFBSs between individual orthologous genes by sequence alignment, our approach of enriching TFBS combinations involved first obtaining all potential TFBSs on a genome-scale and then looking for TFBS combinations based on the pattern of binding site arrangement on promoter sequences. For homotypic TF combinations, the pattern might include the number of TFBSs and relative between-TFBS distance. Utilizing this approach, conserved TFBS combinations that are located in different positions on the promoter sequences of orthologous genes can be identified, thus eliminating problems caused by low sequence similarity and sequence insertion or deletion.
Detailed analysis of identified E2F1 binding sites exemplifies this point; Figure 6a shows conservation of putative E2F1 binding sites on human and mouse promoter sequences from a between-TFBS distance of 20 bp. Overall, the arrangement of TFBSs is highly conserved (with the exception of one extra binding site in the mouse STAG1 gene). In some genes, such as the TSPAN14 and FBN2 genes, E2F1 binding sites are in exactly the same position on mouse and human promoters, while in other genes, such as the E2F1 and YY1 genes, the TFBS pairs are in vastly different locations. We thus hypothesize that it is very unlikely that traditional approaches like phylogenetic footprinting could identify all of these putative synergistic TF interactions. To test this hypothesis, we used the rVista program to perform phylogenetic footprinting to search for conserved binding sites between human and mouse promoter sequences for all genes [57]. Notably, although phylogenetic footprinting detected synergistic TFBSs for four genes, it missed the remainder (Figure 6a). It is important to note that the majority of these genes are regu-lated by E2F1, as their promoters were experimentally proven to be bound by E2F1. These genes include STAG1 [58], YY1 [58], CDCA7L [58], RNF167 [58], FBN2 [58], NULP1 [58], DTNB [58], MYBL2 [59], and E2F1 [25,60]. Phylogenetic footprinting was not able to detect any E2F1 binding site in promoters of the E2F1, STAG1, YY1, CDCA7L, and NULP1 genes.
To investigate whether the predicted combinatorial TFBSs that were not detected by phylogenetic footprinting are truly functional ones, we searched for genes whose promoters have experimentally proven synergistic E2F1 binding sites. One promoter of the above five genes, the E2F1 promoter, was well-characterized from both human and mouse to be synergistically bound by E2F1 (representing a self-regulatory loop) [25,60]. Sequence comparisons of both E2F1 binding sites (Figure 6b), as well as the entire promoter sequences from both human and mouse, indicated that our predicted E2F1 binding sites were exactly those experimentally proven, functional E2F1 binding site combinations on E2F1 promoters. This observation suggests that other predicted synergistic E2F1 binding sites, without experimental evidence, likely represent functionally conserved elements, despite the fact that the relative locations of binding sites might vary between species. A good example is the ACVR1 gene with two E2F1 binding site clusters containing a total of five putative binding sites, which show a similar arrangement between orthologous genes but are located at different positions on the promoter. A closer examination demonstrates that these two clusters are highly conserved in regards to both nucleotide sequence and spacing between each binding site within each cluster ( Figure  6b), suggesting that they are indeed functionally conserved. Importantly, phylogenetic footprinting detected only one of the five E2F1 binding sites in the ACVR1 gene.

Quantitative comparisons of function conservation with other methods
The above results indicated that our approach was able to identify more truly functional TFBSs than phylogenetic footprinting. We also performed further studies to make quantitative comparisons of our function conservation method with Correlations between 19 LOD co and their corresponding LOD og scores for each of 51 homotypic TF combinations are listed. Also listed are the statistical significances of the correlations computed from permutation tests using randomly paired LOD co with LOD og scores. phylogenetic footprinting and the enhancer element locator (EEL) algorithm [61]; the latter also employs distance constraints to help identify interacting TFs. To facilitate this analysis, we obtained a set of 6,183 human genes whose promoters were experimentally proven to be bound by E2F1 within 1 kb upstream of the TSS in HeLa cells [58]. Out of these promoters, 1,591 (Additional data file 1) are in the promoter list of human genes used in this study and have at least one E2F1 binding site (PWM: E2F1_Q3_01). We first sought to obtain promoters with combinatorial E2F1 binding sites with given distance constraints. We subsequently computed the conditional probability that synergistic E2F1 binding sites are spaced in a given distance constraint, given E2F1 binding sites in these E2F1 target promoters. This conditional probability, as measured from real promoters, was then compared to those measured from promoters with shuffled sequences and used to compute the statistical significance for each individual distance constraint. We observed significance of E2F1 synergy for distance constraints from 10 bp to 600 bp (p values from 5 × 10 -4 to 6 × 10 -34 with q-value < 0.001) and obtained the corresponding promoters (Table 4).
Using these human gene promoters with combinatorial E2F1 binding sites (see Additional data file 2), we next assessed the sensitivity and specificity for detecting synergistic E2F1 combinations by function conservation and phylogenetic footprinting. In this analysis, real promoters were used as true positives and the corresponding randomized promoters with shuffled nucleotides as true negatives. Sensitivity (the fraction of promoters that were identified to have combinatorial E2F1 binding sites) was defined as the proportion of true positives over combined true positives and false negatives, and specificity as the proportion of true negatives over combined true negatives and false positives, the latter being the fraction of randomized promoters that were identified to have synergistic E2F1 combinations. We applied our function conservation, phylogenetic footprinting (using rVista), and EEL (stand-alone version for pairwise analysis) to the selected human and their corresponding mouse orthologous gene promoters. Results indicated that our function conservation approach had much higher sensitivity (approximately tenfold) than phylogenetic footprinting for all distance constraints tested, as shown in Table 4. On the other hand, both approaches had equally excellent specificity with no false positives detected using three sets of shuffled promoter sequences. We were also curious to know the sensitivity of detecting promoters with any number of conserved E2F1 binding sites by phylogenetic footprinting. Notably, although phylogenetic footprinting was able to detect 12.9% of the 575 E2F1 target human promoters with one or more E2F1 binding sites, the positive rate was still much lower than those from our function conservation approach (20.3%) for combinatorial TFBS detection.
Results of this analysis further indicated that the EEL algorithm was able to detect conserved pairs or clusters of E2F1 sites in only 9 of the 575 target human promoters, demonstrating a much lower sensitivity (1.6%) than our function conservation approach. It is interesting to note that EEL detected multiple single E2F1 sites in many target human promoters. Although these E2F1 sites may not be conserved ones based on the underlying premise of the EEL algorithm, we nonetheless manually calculated all possible combinations of E2F1 sites for each target promoter. The overestimated positive rates are listed in Table 4, where EEL still displays much lower sensitivity (approximately 0.5-fold) than our function conservation approach for all distance constraints tested. Furthermore, false positives were detected by Distribution of LOD scores for selected TFBSs from all distance constraints Figure 4 Distribution of LOD scores for selected TFBSs from all distance constraints. (a) LOD co scores of both human (hs_LOD co ) and mouse (mm_LOD co ) and LOD og scores for E2F1. Also shown are the correlations of LOD og and LOD co for human (R hs ) and mouse (R mm ) and corresponding p values. (b) LOD co scores of both human (hs_LOD co ) and mouse (mm_LOD co ) and LOD og scores for MYOGENIN. Also shown are the correlations of LOD og and LOD co for human (R hs ) and mouse (R mm ) and corresponding p values.  the EEL algorithm using three sets of shuffled promoter sequences (Table 4), indicating lower specificity for EEL. Taken together, these results indicated that our approach was able to identify conserved TFBS combinations to a much greater extent than phylogenetic footprinting and the EEL algorithm.

Prediction of heterotypic TF interactions and TF-TF interaction networks
In an effort to expand our analyses to a more complex, perhaps more physiologically relevant situation, we applied our novel approaches to identify potential heterotypic TF combinations using the selected 51 TFs; a total of 1,275 TF combinations was considered. Correlations between LOD co and LOD og scores for these TF combinations had similar distributions to those from homotypic TF combinations, ranging from -0.96 to 0.99 for human and from -0.94 to 0.99 for mouse (data not shown). Statistical significance of the correlations estimated from permutation tests indicated that the number of significant TF combinations was overwhelming, even when a highly stringent cutoff threshold was applied (q-value < 0.01). Nevertheless, we selected 78 heterotypic TF combinations that passed the cutoff threshold and also had high correlations in both human and mouse (R > 0.85) for further TF interaction analysis. We felt that this combination of using a low q-value along with a high correlation coefficient would allow us to minimize false positive predictions.
As shown in Figure 7, TF-TF interaction networks, utilizing these 78 heterotypic TF pairs constructed by the InterViewer program [62], segregated into two clusters, with a single link between PIT1 and CETS1P54. Thirty-six TFs were predicted to have significant interactions with one another and are represented in the figure. The smaller cluster consists of 19 synergistic TF pairs from 15 TFs, while the remaining 58 synergistic TF pairs constitute the larger cluster from 21 individual TFs. It is important to note that the binding sites for the TFs in the smaller cluster are almost all AT-enriched, while conversely, those in the larger cluster are mostly GCenriched. A closer look at the smaller cluster reveals that HNF3B, FOX and FREAC7 (also called FOXL1), all members of the forkhead box family of transcription factors ((( [63], are directly coupled to each other, suggesting that these TFs from the same family may function in a synergistic fashion. We have also performed a PubMed search to determine if SP1 is known to physically interact with any of the 13 factors to which it is connected by this analysis; we chose to delve into SP1 further as it is one of the first TFs discovered and has been highly studied. We found experimental evidence that SP1 physically interacts with EGR [64], CP2 [65], MZF1 [66], MAZ [67], PU1 [68], CETS1P54 (ETS-1) [69], E2F1 [70], ETF (TEF-1) [71], and KROX [72]. While SP1 binds to the GC-box elements and ZF5 is the analogous factor of SP1 [73,74], a couple of the factors could not be identified in PubMed by the gene names listed in Figure 7, so their interactions with SP1 are unknown. Overall, these data suggest that this approach is valid for identifying heterotypic TF interactions.

Discussion
The identification of combinatorial TFs and the elucidation of relationships among them are of great importance for understanding transcriptional regulation as well as TF networks. Previous approaches employed for the identification of functional TF combinations are based on either TFBS enrichment from co-regulated genes or phylogenetic footprinting. Although both approaches have proven to be successful, they each have limitations. To explore alternative approaches, we propose a new strategy to look for synergistic TFs by function conservation, which was implemented from functional conservation of TFs between species and corresponding TFBSs between orthologous genes.
Although prior to our study there had not been a genomewide function-based approach for the prediction of combinatorial TFs, several previous studies employed distance constraints to help identify interacting TFs [9,16,61], including the EEL algorithm [61]. A side-by-side comparison of the function conservation approach versus EEL is included in Table 5. It is important to point out that the core approaches for computing functional conservation at both levels were based on genome-scale information of orthologous genes, from which both TFBS combinations and common orthologous genes whose regulatory sequences contain the identified TFBS combinations were enriched. Integration of function conservation from both levels by correlation analysis led to the final prediction of combinatorial TFs. While GO analysis could provide functional annotation for the predicted TF combinations, it was employed mainly as part of our procedures to validate our approaches. Therefore, our validated approaches are readily applicable to other genomes for the prediction of physically interacting TFs and TF net-

Function annotation for 51 homotypic TF combinations
works. This is likely the case as long as relative complete genes and promoter sequences are available for these genomes so that orthologous genes between closely related species can be determined correctly by pairwise alignment and cluster analysis.
The enrichment of functional TFBSs plays an important role in the integration of function conservation from different levels by correlation analyses, as the functional conservation of TFs is based on genes whose promoter sequences contain over-represented binding sites. In this study, TFBS enrichment was achieved by first obtaining all potential TFBSs on a genome-scale, followed by searching for TFBS combinations by using distance constraints. Although this method was successful, other methods can also be used as long as they are able to identify over-represented TFBSs or reliably distinguish functional from non-functional TFBSs. Therefore, the strategy of function conservation is not limited to synergistic TF discovery, but is applicable to single TFs and even transcriptional regulatory modules.
Functionally conserved E2F1 binding sites in human and mouse genes Figure 6 Functionally conserved E2F1 binding sites in human and mouse genes. (a) Schematic alignment of functionally conserved E2F1 binding sites between human (hs) and mouse (mm) promoter sequences from between-TFBS distance of 20 bp. Also listed are the numbers of conserved E2F1 binding site(s) detected by phylogenetic footprinting (PF). Asterisks indicate promoters of genes with experimentally proven E2F1 binding sites. (b) Sequence alignment of synergistic E2F1 binding sites from the E2F1 gene and two E2F1 binding site clusters from the ACVR1 gene. Core motifs are shown in upper case letters, and the distances between adjacent binding sites are shown in brackets. Also shown are the locations of each binding site in relation to the transcription start site.
A related question is if there exist optimal distance constraints for all TFBSs, which would result in significant and optimal correlations between LOD co and LOD og scores for individual distance constraints. We reasoned that an optimal distance constraint is most likely to have not only a significantly large correlation but also a higher correlation than those from its two adjacent distance constraints. Interestingly, although LOD og and LOD co scores displayed general patterns of increasing correlations along with smaller between-TFBS distance (data not shown), none of these cor-relations were statistically significant (p > 0.3), based on random correlations from 10,000 permuted LOD co and LOD og scores from the same distance constraint. Although no optimal distance constraints were found for all TFBSs, we noticed that correlations between TFBS enrichment and common orthologous gene enrichment from those 51 selected TFBS combinations displayed significance for some, but not all, distance constraints ( Figure 5). These results indicated that optimal distance constraints, if any, might vary among different TFBSs.
The incorporation of functional conservation of TFs can provide further stringency for synergistic TF discovery, since computational methods for enriching or characterizing functional TFBSs are likely to contain false predictions. In this study, the analyses of concordance of TFBS enrichment with overlapping orthologous gene enrichment are clearly appropriate for identifying functional TF combinations, as correlations detect the agreement between TFBS enrichment and orthologous gene enrichment, no matter whether all the enriched TFBSs are functional ones or not. We thus employed correlation analyses to achieve integration of function conservation from both TF and TFBS levels. Although correlation between LOD co and LOD og scores from each individual TFBS would provide a direct measurement for its functional co-  occurrence, it did not necessarily reflect the degree of similar changing trends between LOD co and LOD og scores, which was especially important for the detection of functionally cooccurring TFBSs, as not all enriched TFBSs truly represent functional interactions. We also found that permutation tests for estimating the statistical significance of correlations is appropriate for not only setting up the cutoff threshold for the selection of the most statistically significant correlations from multiple analyses, but also for selecting TFBSs with similar trends between LOD co and LOD og scores.
The validity of functional conservation of TFBSs was also assessed by experiments from our previous investigations. In a study to identify common transcriptional regulatory elements in a group of interleukin-17 target genes [75], we employed a similar approach to obtain all potential TFBSs and looked for conserved TFBSs based on the pattern of TFBS arrangement on promoter sequences. TFs NF-κB and C/EBP were found to mediate combinatorial regulation for interleukin-17 target genes, which was validated by both luciferase reporter gene and gel shift assays. These observations provide direct proof that our computational methods can predict functional TFBSs.
In another investigation, we utilized a similar approach to identify over-represented TFBSs in a group of genes induced in the intestine of iron-deficient rats [76]. Microarray and clustering analyses led to the identification of 228 upregulated genes during iron-deficiency across several stages of postnatal development. We pulled promoter sequences for these genes in rats, mice and humans and performed enrichment analysis for TFBSs. Our results indicated that SP1-like binding sites were significantly over-represented in our experimental genes in all three species compared to random background sequences. Our analyses also predicted that SP1or a related TF could function in a synergistic fashion with the FOX TF to regulate some of our identified genes. We then inspected the promoter sequences of genes containing the predicted binding sites for the location and sequence of the TFBSs. Strikingly, we found that the SP1 and FOX binding sites were highly conserved in many of these genes; in some cases, the sequence and location in the promoters across species was identical while in other cases, the sequence and spacing were conserved, but the relative location of the binding sites was variable. These observations again emphasize the point that traditional methods such as phylogenetic footprinting are likely to miss important, conserved elements, and further that our novel approaches are a Topology of TF-TF interaction network valid alternative method that should prove useful for functional TFBS prediction.

Conclusion
Previous methods employed for the identification of functional TF combinations are based on either TFBS enrichment from co-regulated genes or phylogenetic footprinting. In this study, we have proposed the use of function conservation for the identification of TF combinations and have developed computational approaches to implement this novel strategy. These approaches include functional TFBS enrichment based on the pattern of binding site arrangement on promoters of orthologous genes by distance constraint, enrichment of overlapping orthologous genes whose regulatory sequences contain the enriched TFBS combinations, and integration of function conservation from both TF and TFBS levels by correlation analyses. To assess the usefulness of these approaches, we have applied them to human and mouse orthologous promoter sequences. Genome-wide promoter analyses have led to the identification of 51 homotypic TF combinations, which were validated by both known TF-TF interactions and function coherence analyses. The main advantage of our strategy lies in the fact that it does not depend on sequence alignment but rather genome information, thus providing an alternative method for functional TF discovery and annotation. We have further compared our approach directly with phylogenetic footprinting and clearly demonstrated that function conservation analysis was better able to predict synergistic, functional TFBSs. Overall, this study has introduced a novel approach for the discovery of functional TFBSs, which should be applicable to any species for which genome information is available.

Promoter sequence sources
Orthologous genes between human and mouse were obtained from the 'relational table between human and mouse mutu-ally orthologous genes' in DBTSS, which contains approximately 12,020 orthologous gene pairs. Out of these orthologous genes, 10,046 have defined promoter sequences for both human and mouse. Alternative promoter 1 was used for all genes, as alternative promoter 1 is the most upstream promoter in the case of multiple alternative promoters for a gene in DBTSS. Promoter sequences within 1 kb upstream of the annotated TSS of the 10,046 genes for both human (Additional data file 3) and mouse (Additional data file 4) were extracted from DBTSS and used for TFBS detection with the repeat sequences unmasked.

TFBS detection
To perform the measurement for significant co-occurrence of TFBSs, an appropriate randomly generated background model is necessary. A Perl script was written to shuffle the DNA sequences within each promoter to create background sequences that have the same nucleotide content as the original promoter sequences. These background sequences are preferable to using intergenic sequences, which usually are AT-rich or exonic sequences whose nucleotide distributions tend to be biased. The resulting two sets of shuffled sequences from human and mouse, together with the original two sets of promoter sequences, were used for TFBS detection. To provide important information for estimating statistical significance for single gene pairwise comparisons, multiple shuffled sequences as background are usually employed. In our case, however, multiple sets of shuffled promoter sequences as background are not necessary, since no direct pairwise gene comparisons were performed; instead, comparisons were performed on a genome-scale using all shuffled sequences as background. The Match ® program [22], for which the profile parameter was set to be 'minimize the sum of false positives and negatives', was employed to conduct searches for TFBSs using vertebrate position weight matrices from the professional TRANSFAC 9.1 database. To eliminate redundant PWMs for the same TF (prefix of the 'Identifier'), the one with largest TFBS enrichment (see next section) or with highest quality control for building the PWM was selected, which resulted in 234 PWMs (Additional data file 5) for unique factors in this study.

Enrichment of TFBS combinations by distance constraints
The co-occurrence of TFBSs was defined as two or more binding sites in the same promoter, which can be TFBSs either from the same PWM or from two PWMs. For combinatorial TFBS enrichment, distance constraints were first applied for the selection of co-occurring TFBSs with a certain defined maximum between-TFBS distance, which is the largest number of nucleotides allowed between the end of the first TFBS and the start of the next TFBS. In the case of two PWMs, the distance constraints were applied to TFBSs from 2 different PWMs but not from an identical PWM, and the number of TFBS from each individual PWM may vary. To prevent double counting of TFBSs, two overlapping matches for the same PWM were considered as a single match and the one closer to the next co-occurring TFBS was counted. A total of 18 distances were defined, ranging from the smallest 10 bp to the largest 900 bp, for which 10 bp increments were used for those with distance constraints of less than 100 bp and 100 bp increments for those with distance constraints of more than 100 bp. The counts of co-occurring TFBSs from distance constraints with smaller between-TFBS distances were also included in those with larger between-TFBS distances. The occurrence of TFBSs without a distance constraint was also computed, but overlapping TFBSs were counted only once. All calculations were separately computed using in-house developed Perl scripts (Additional data file 6).
The log odds ratio (LOD) was used to measure TFBS enrichment, which is the frequency of co-occurrence for a particular TFBS in real promoter sequences with respect to random expectation for the co-occurrence of the same TFBS from one corresponding set of shuffled promoter sequences: Where f r (#TFBS/#promoters) is the TFBS frequency per gene from real promoter sequences, and f s (#TFBS/#promoters) is the frequency from shuffled promoter sequences.

Enrichment analyses of overlapping orthologous genes
For enrichment analyses of overlapping orthologous genes, genes whose promoter sequences contain at least two cooccurring TFBSs were obtained from both human and mouse. The p value for the between species enrichment of overlapping orthologous genes, which contained the same TFBS combinations from the same enforced distance constraint in both human and mouse, was computed using a hypergeometric distribution: Where S 1 and S 2 are the numbers of genes whose promoters possess the co-occurring TFBSs from either mouse or human, respectively; N is the total number of orthologous gene pairs used in this study (N = 10,046); and c is the number of common orthologous genes between S 1 and S 2 . The resulting p value is the chance probability of observing c or more common orthologous genes from two sets of size S 1 and S 2 drawn from a set of N gene pairs. The enrichment analyses of overlapping orthologous genes were performed for both original and shuffled promoter sequences. The latter, as described below, was obtained for the purpose of normalization.
For computing the correlation between the TFBS enrichment utilizing distance constraints and the overlapping orthologous gene enrichment, the p values of overlapping orthologous gene enrichment from original promoter sequences were first (-)log transformed and then normalized with the corresponding (-)log transformed p values from shuffled promoter sequences. To have a similar scale as LOD co , the LOD for overlapping orthologous gene enrichment was computed as follows:

Analyses of correlations between the TFBS enrichment and the overlapping orthologous gene enrichment
Pearson correlation coefficients were employed to estimate the correlation between the 19 LOD co and their corresponding LOD og scores for each TFBS combination, and permutation tests were employed to estimate the statistical significance of the correlation. To perform the permutation tests, the 19 LOD co and their corresponding LOD og scores from each TFBS combination were randomly paired and used for random correlation computation. The procedures were repeated 10,000 times to reach a random distribution and to give sufficient power for the estimation of p values. The resulting p values were used to set up a cutoff threshold (q-value < 0.05) for the selection of the most statistically significant correlation from multiple analyses.