Sequence signatures extracted from proximal promoters can be used to predict distal enhancers

Background Gene expression is controlled by proximal promoters and distal regulatory elements such as enhancers. While the activity of some promoters can be invariant across tissues, enhancers tend to be highly tissue-specific. Results We compiled sets of tissue-specific promoters based on gene expression profiles of 79 human tissues and cell types. Putative transcription factor binding sites within each set of sequences were used to train a support vector machine classifier capable of distinguishing tissue-specific promoters from control sequences. We obtained reliable classifiers for 92% of the tissues, with an area under the receiver operating characteristic curve between 60% (for subthalamic nucleus promoters) and 98% (for heart promoters). We next used these classifiers to identify tissue-specific enhancers, scanning distal non-coding sequences in the loci of the 200 most highly and lowly expressed genes. Thirty percent of reliable classifiers produced consistent enhancer predictions, with significantly higher densities in the loci of the most highly expressed compared to lowly expressed genes. Liver enhancer predictions were assessed in vivo using the hydrodynamic tail vein injection assay. Fifty-eight percent of the predictions yielded significant enhancer activity in the mouse liver, whereas a control set of five sequences was completely negative. Conclusions We conclude that promoters of tissue-specific genes often contain unambiguous tissue-specific signatures that can be learned and used for the de novo prediction of enhancers.


Supplementary Figures
. AUC of promoter-based classifiers (on the right) trained on specific tissues and tested on all tissues (at the top).
The diagonal corresponds to classifiers trained and tested on the same tissues. Only models for which we obtained a reliable classification ≥ . are depicted. For the heatmap, columns were clustered using Euclidean distance and average. Promoters shared between two tissues were excluded from the test set. Figure S2. AUC accuracy of promoter-based models evaluated using a 5-fold cross-validation as a function of the fraction of the promoters of the 200 most highly expressed genes in a given tissue that were used for training the models that overlap CpG islands. Only 10 outliers are labeled. Figure S3. AUC accuracy of promoter-based models evaluated using a 5-fold cross-validation as a function of the fraction of the promoters of the 200 most highly expressed genes in a given tissue that were used for training the models that contain TATA box motifs. Only 10 outliers are labeled. Figure S4. AUC accuracy of promoter-based models evaluated using a 5-fold cross-validation as a function of the average percent of sequence identity between human and mouse within a 50bp-window around the TSS of 200 most highly expressed genes in a given tissue. Only 10 outliers are labeled. Figure S5. Number of loci in which at least one of the scanned sequences was considered a tissue-specific enhancer prediction divided by the total number of loci to which we applied the classifier, for each of the 71 tissues for which we obtained reliable promoter-based models, and both loci of highly expressed (in red) and inhibited (in black) genes.

Figure S6. Boxplots showing median and interquartile range for the scores of enhancer predictions in loci of highly expressed (in red) and inhibited (in gray) genes for each of the 71 tissues for which we obtained reliable promoter-based models.
Figure S7. Receiver operating characteristic (ROC) curve evaluating the performance of the liver classifier at predicting conserved liver enhancers in loci of highly expressed genes. Predictions were compared with four different enhancer markers. A) ChIP-seq on p300 in adult mouse liver [1]. B) Enhancers predicted on the basis of chromatin modification marks and p300 [2]. C) Genome-wide DNase hypersensitivity analysis (DNase-Seq) in mouse liver [3] D) Strong enhancers predicted in HepG2 cells using chromHMM [4]. Coordinates in the mouse genome were mapped to the human genome using the LiftOver tool [5].

Promoter-based models are robust to changes in the number of promoters in the training set
To assess to which degree our models describing tissue-specific promoter activity could be exploited to discover enhancers, we applied each of the 73 reliable models trained on promoter regions to predict enhancers in the loci of genes that were among the 200 most highly or lowly expressed in the corresponding tissue (see Methods).
Although this cut-off of 200 promoters is admittedly somewhat arbitrary, our results are relatively robust to this

Sets of lowly expressed genes are tissue-specific
Depending on the tissues involved in the analysis, specific gene families with very restricted expression domains may be over-represented among sets of most lowly expressed genes. However, we did not observe a large overlap between the sets of most lowly expressed genes chosen for each tissue. Between any pair of tissues we found an overlap of 7%, as reflected in Figure II. Also clusters computed on pairwise overlaps reflect relationships between the tissues. For these clusters we found pairwise overlaps close to 40%.

Highly expressed genes in each of the 79 tissues considered exhibit distinct structural and genomic characteristics
In contrast to housekeeping genes, which are ubiquitously expressed and perform basic cellular functions, tissue-specific genes are selectively expressed at a usually much higher level in a few tissues only. Little is known about the genomic features and mechanisms responsible for both basal and tissue-specific expression. For this reason, we first examined the differences between the 200 most and least highly expressed genes in each of 79 human tissues whose transcriptomes are available in the GNF Gene Expression Atlas 2 [7]. We found that for most tissues, the median of the expression value of the most highly expressed genes in a given tissue is not only significantly higher than the expression of an average gene (log ratios 2.1 to 5.2, Figure III), but also with respect to the median of the expression value in the remaining tissues (log ratios 2.0 to 5.0, Figure IV), suggesting that most of these genes are expressed in a rather tissue-specific manner.
Sequence analysis reveals differences among highly expressed genes in different tissues. For example, genes most highly expressed in the nervous system are relatively well conserved, with over 70% of these genes preceding the tetrapod radiation, as compared to only 60% of genes highly expressed in testis, for example ( Figure V). Also, genes highly expressed in the nervous system are separated from other genes by relatively long stretches of noncoding sequence enriched with deeply conserved noncoding elements (CNEs). For example, genes highly expressed in the fetal brain and the adult brain are separated from their immediate neighboring genes by an average distance of 54 and 39 kb, respectively, while genes highly expressed in the heart are at an average distance of only 9 kb from their nearest neighbors ( Figure VI). In turn, intergenic regions flanking genes highly expressed in the nervous system contain a large number of CNEs, with the average of 3.0x10 -5 and 2.6x10 -5 human/zebrafish CNEs per kb per intergenic region for fetal and adult brain, respectively, as compared with the global average of 5.0x10 -6 for the human genome (P-values << 0.05, computed with the Wilcoxon Rank-Sum Test, Figure VII). Such evolutionary and structural peculiarities may underlie distinct mechanisms governing tissue-specific gene expression.
In summary, these results support the hypothesis that requirements for tissue-specific regulation impose selection constraints on coding and noncoding sequences, hence shaping gene structure, genomic distribution, and evolutionary conservation of tissue-specific genes and their regulatory regions. We speculate that tight tissue-specific control is achieved through specific interaction between enhancers and promoters, and thus, that particular signatures within tissue-specific promoters can be identified and used to predict active enhancers. Identifying such signatures would provide a computational, genome-wide means of identifying potential distant enhancers helping to prioritize costly experimental assays.

The promoters of highly expressed genes in each of the 79 tissues considered are well conserved
In general, the promoter regions of these genes are evolutionary conserved, particularly in close proximity to the TSS (with a median of 63% sequence identity between human and mouse around the TSS, as compared to 58% for an average human promoter, Figure VIII). However, the degree of conservation varies depending on the tissue where the corresponding genes are expressed. For example, genes most highly expressed in the brain are characterized by extremely conserved promoters (e.g., 74% identity as compared to mouse, in the case of cerebellum peduncles), while promoters of genes expressed in other tissues, such as testis, are less conserved (e.g., 54% identity for testis interstitial cells). Consistent with previous studies, we observed that promoter conservation is modestly but positively correlated with coding conservation (r-squared = 0.4, p-value = 2.2x10 -9 , computed using the F-test, [8]). In addition, most promoter regions of most highly expressed genes contain at least two conserved noncoding elements [CNEs, [9], data not shown], supporting the existence of multiple discrete regulatory units in close proximity to the TSS. binding sites. However, such differences are also likely to highlight contrasts in gene regulation across different tissues, both in terms of complexity, as well as differential usage of promoters and enhancers. Although CpG-rich promoters typically harbor relatively few overrepresented motifs [10], for promoters of most highly expressed genes we found that the number of TFs with significantly enriched binding sites does not strongly correlate with CpG content, and ranges between 94 for colorectal adenocarcinoma, to 0 in the case of BM-CD71+ early erythroid cells, leukemia promyelocytic (hl60), pituitary gland, testis germ cells, testis interstitial, and uterus corpus ( Figure IX). Our inability to identify relevant TFs may suggest that, for some tissues, tissue-specific regulation mainly depends on enhancers or other factors such as chromatin structure.
Notably, the known biological role of the TFs with enriched binding sites matched their respective tissues.
For instance, while 9 motifs enriched among promoters of genes expressed in fetal liver mainly correspond to binding sites for HNF1A [11], the 44 motifs enriched among promoters of genes expressed in the adult liver are consistent with the binding sites of a more extensive list of TFs, including HNF4A, PPARA, PPARG, NR5A2, and NR2F1, each known to play a major role in adult liver function [12][13][14][15][16]. Together, these observations are largely consistent with the established model in which ubiquitous regulatory activity is exerted by promoters [17][18][19], but also suggest that some promoters function in a tissue-specific manner, in agreement with recent evidence from next-generation sequencing projects [2].
We also detected differences in the distribution of enriched TF binding sites, which is not uniform along the promoter region. Enriched TF binding sites tend to co-localize towards the TSS ( Figure X), but, in agreement with observations in yeast (e.g., [20]), we observed differences between the architecture of TATA-containing and TATAless promoters. In TATA-containing promoters, such as those of most genes expressed in the peripheral nervous system (trigeminal, dorsal root, and superior cervical ganglion), TF binding sites tend to be more broadly distributed further upstream of the TSS, possibly suggesting differences in the recruitment of pre-initiation complexes and implying different modes of regulation for transcription initiation among different groups of promoters, and, therefore, tissues.

Differences in GC-content between promoters of highly expressed and inhibited genes contribute, but do not fully explain cross-validation performance of promoter-based models
Several hypotheses have been proposed to explain GC-content heterogeneity along the human genome, including a role for GC-content in chromatin organization and gene regulation [21,22]. We did not observe a strong correlation between differences in the GC-content of the promoters associated with highly expressed and inhibited genes (p-value > 0.05, Spearman Rank Correlation Test). Moreover, the differences in GC-content tend to be smaller for reliable models (log2 fold-change = -0.02), as compared with those for unreliable models (log2 fold-change = -0.05). Furthermore, the GC-content of the most predictive TF binding sites does not necessarily reflect the difference in GC-content between the promoters of highly expressed and inhibited genes in a given tissue. For example, in the case of the liver model, which yielded an AUC value of 0.96 ± 0.04, promoters of highly expressed genes have significantly higher GC-content as compared to those of inhibited genes (0.53 and 0.50, respectively, p-value = 9.8x10 -7 , Wilcoxon Rank-Sum Test), but there is no significant difference between the GC-content of the top and bottom 50 predictive TF binding sites (p-value > 0.5, Wilcoxon Rank-Sum Test). Also, we obtained an AUC value of 0.93 ± 0.05 after randomly permuting the nucleotide probabilities in the position-weight matrices (PWMs) which were used to identify TF binding sites (see Methods), which is significantly lower than the AUC value achieved by the original model (p-value = 0.01, Wilcoxon Rank-Sum Test). Taken together, these results demonstrate that differences in GC-content contribute but cannot fully explain the variation in performance of the models.

TF-based promoter-enhancer interactions
Our results suggest that TFs bound to both promoters and enhancers can at least partially explain the specificity observed in enhancer-promoter interactions. This mechanism may be of particular importance in complex genetic loci. To further investigate this relationship, we measured the number of putative enhancers associated with each promoter region. We found that for ~50% of tissues, the loci of genes with promoters exhibiting tissue-specific signatures harbor more enhancer predictions than those of genes with promoters with no tissue-specific signatures (pvalues ≤ 0.05, Wilcoxon Rank-Sum test, Figure XI; the number of enhancer predictions was normalized to account for variations in locus length). For the remaining tissues we did not observe any significant differences. We hypothesize that while the expression of genes with promoters containing weak tissue-specific signatures might be regulated mainly by epigenetic modifications, genes relying heavily on promoter-enhancer interactions would incorporate an additional layer of control in the fine-tuning of tissue-specific transcription. is computed as the number of enhancer predictions in loci of highly expressed genes divided by the total number of sequences scanned in loci of highly expressed genes. Promoter strength was assumed to be a function of the maximum score calculated for the CNEs in the promoter region with the corresponding tissue-specific model. While for 50% of the tissues considered loci of genes associated with stronger promoters tend to harbor more enhancers, for the remaining 50% loci of genes associated with stronger promoters tend to harbor less enhancers. The differences, however, are mainly significant for the former, which also display higher proportion of enhancer predictions. Significant differences in the number of enhancer predictions relative to the total number of sequences scored in the loci of genes with strong promoters (red) as compared to weak promoters (gray) is indicated with an asterisk.

Conservation of promoter-based enhancer predictions
To determine the generality of this trend and detect enhancers that are under no detectable evolutionary constraint [23][24][25], we extended our functional analysis to sequences that are not necessarily conserved in mouse. To this end, we applied the promoter-based models to the sequence of loci of highly expressed and inhibited genes using a sliding window approach. For the size of the window, we chose the average length of conserved region between human and mouse [9], namely 230 bp. The sliding window was shifted by 115 bp. Windows overlapping the sequence 2.5 kb upstream and 0.5 kb downstream of the nearest TSS according to the refGene.txt and knownGene.txt tables (available in the UCSC Genome Browser database, [26]) were excluded from further analysis. A given sequence was considered an enhancer prediction if its score was greater than = 0, , where is the lowest score of the top 5% sequences scored in the control loci.
As for predictions based on conserved noncoding sequences across the human and mouse genome, we evaluated the performance of the models by comparing the proportion of enhancer predictions and scanned sequences.
For 70 % of the tissues our enhancer predictions are strongly enriched in loci of highly expressed genes as compared to inhibited genes (p-values ≤ 0.05, Fisher's Exact Test, Figure XII). The three healthy tissues showing the most pronounced enrichment are also heart, lung, and liver, with fold differences of 25, 9, and 6, respectively. Furthermore, our enhancer predictions show extensive overlap with noncoding sequences associated with enhancer activity, such as DHS sites (39 %) and H3K4me1 histone marks (55 %) in various ENCODE cell lines (with fold enrichments 1.3 and 1.2, respectively, p-values < 0.001, computed based on 1000 randomized sequences genome-wide), suggesting that the promoter-based models are also able to predict distal enhancers genome-wide, independent from sequence conservation. However, and although the level of conservation of these enhancer predictions varied depending on the tissue, we found that most predictions were well conserved in mammals (with 37 to 71% of predictions, depending on the tissue, overlapping CNEs, see Figure XIII). Consistent with previous results, we observed that whereas 11 % of predictions for enhancers active in the brain were conserved beyond mammals and birds, only 4 % of predictions for enhancers active in the skin were as deeply conserved. Indeed, p300-binding regions have been shown to exhibit a variable degree of evolutionary conservation across different tissues [25]. Also, for 80% of tissues, including brain, heart and liver, we observed that most conserved mammalian noncoding sequences [27] are significantly enriched in our predictions, as compared to the random expectation (p-value ≤ 0.05, calculated using the Wilcoxon Rank-Sum Test, Figure XIV), substantiating the focus on evolutionary conserved regions in genome-wide functional analysis.