- Open Access
A simple metric of promoter architecture robustly predicts expression breadth of human genes suggesting that most transcription factors are positive regulators
© Huminiecki et al.; licensee BioMed Central ltd 2014
Received: 10 December 2013
Accepted: 15 July 2014
Published: 31 July 2014
Conventional wisdom holds that, owing to the dominance of features such as chromatin level control, the expression of a gene cannot be readily predicted from knowledge of promoter architecture. This is reflected, for example, in a weak or absent correlation between promoter divergence and expression divergence between paralogs. However, an inability to predict may reflect an inability to accurately measure or employment of the wrong parameters. Here we address this issue through integration of two exceptional resources: ENCODE data on transcription factor binding and the FANTOM5 high-resolution expression atlas.
Consistent with the notion that in eukaryotes most transcription factors are activating, the number of transcription factors binding a promoter is a strong predictor of expression breadth. In addition, evolutionarily young duplicates have fewer transcription factor binders and narrower expression. Nonetheless, we find several binders and cooperative sets that are disproportionately associated with broad expression, indicating that models more complex than simple correlations should hold more predictive power. Indeed, a machine learning approach improves fit to the data compared with a simple correlation. Machine learning could at best moderately predict tissue of expression of tissue specific genes.
We find robust evidence that some expression parameters and paralog expression divergence are strongly predictable with knowledge of transcription factor binding repertoire. While some cooperative complexes can be identified, consistent with the notion that most eukaryotic transcription factors are activating, a simple predictor, the number of binding transcription factors found on a promoter, is a robust predictor of expression breadth.
Is it possible to predict expression parameters of a gene from knowledge of the promoter architecture of that gene? If, for example, we knew the transcription factors (TF) that bind the promoter of a gene, can we predict the breadth of expression (BoE, i.e. the proportion of tissues/cells within which the gene is expressed) or the mean level of expression of that gene? It is known that expression patterns of gene duplicates diverge over evolutionary time [,], but can we predict how different the expression of paralogs will be knowing nothing more than their promoter architecture? What in turn is the relationship between expression breadth and the number of TFs regulating a gene (TfbsNo.)? Given that, in contrast to prokaryotes, the ground state for most eukaryotic genes is inactivity [], we might expect that broadly expressed genes should have very many regulating TFs, assuming eukaryotic TFs are for the most part activating []. However, some very broadly expressed genes might have reverted to a more prokaryotic state and have activity as the constitutive state and hence not require TF activation. Alternatively, the breadth of expression may be conferred by the ability to bind a few specialist transcription factors or through cooperation of particular TFs, in which case the total number of binders need not predict breadth.
At first sight the answer to many of these questions may appear rather trivial: surely if we know the TFs that bind a gene’s promoter and know when those TFs are present in cells then we must know the expression parameters of a gene []? However, an in-depth study of STE12 found that expression changes in response to this transcription factor accounted for only half the observed expression fluctuations []. That the coupling between TF presence/absence need not be such an excellent predictor is indicative of other levels of control. In addition to transcription level regulation (presence/absence of the relevant TFs), genes can be regulated both pre- and post-transcriptionally. Post-transcriptionally, processes such as nonsense-mediated decay (NMD) [], microRNA level regulation [] and modulation of RNA stability [], can also act to reduce the transcript levels below that expected given the transcription rate, potentially buffering larger changes in mRNA levels. Chromatin level pre-transcriptional regulation may be the dominant factor []. This can mean either higher-level chromatin architecture (open/closed chromatin configuration) [] or other epigenetic marks (histone modification, methylation etc.) [,], all of which can modulate the expression of the gene even if the relevant TFs are present.
Much evidence supports a strong role for chromatin in dictating expression profiles. For example, insertion of the same transgene into different regions in the genome leads to different expression levels dependent on the expression profile of the neighbouring genes []. Similarly, a pair of transgenes can be co-expressed if introduced in tandem (so sharing the same chromatin environment) but have uncoordinated expression when introduced into unlinked locations []. Upregulation of one gene is similarly thought to cause a time-lagged ripple of chromatin opening which leads to spikes in the expression of neighbours []. More generally, at least in yeast, physical proximity of genes, is a strong predictor of the degree of co-expression between any two genes []. Indeed, for unlinked genes, on average two genes with the identical repertoire of TF binders, have only a weak degree of co-expression (r 2 ~ 1-2%), much less than the degree of co-expression of two linked genes with no transcription factors in common (r 2 ~ 10%) []. Moreover, DNA methylation was found to increase or decrease BoE depending on the target sequence []; while CpG islands colocalize with most promoters and are characterized by low methylation []. These results all suggest that chromatin level effects are not negligible and that extrapolation from TF binding to expression profile might be a relatively futile enterprise. In contrast to this position, however, is a striking counter-example demonstrating that the expression profile of genes involved in Drosophila segmentation is well predicted by the knowledge of TF binding sites and TF levels [].
One approach to determine the extent to which promoter architecture determines expression parameters has been to consider the relationship between expression divergence and promoter divergence between paralogs within a genome or between orthologs in different genomes [–]. The logic is the same in both instances, namely that if the differences from the ancestral expression profile to current expression profile have been owing to changes in the sequence of the promoters, then comparing multiple genes across genomes (for orthologs) or within genomes (for paralogs) should reveal correlations between the degree of promoter divergence and the degree of expression divergence. In the instance of paralogs there is an additional assumption that the duplicate versions of the same gene were generated in a manner that preserved the promoters. These analyses commonly suggest little or no coupling between promoter divergence and expression divergence, consistent with a weak coupling between promoter architecture and gene expression parameters. For example, within yeasts divergence of transcription factor binding sites (Tfbs) has little impact on expression divergence between orthologs []. Similarly, Park and Makova found in humans that the correspondence of paralog cis-regulatory regions was so weakly correlated with expression divergence in a multiple regression that it wasn’t significant after multi-test correction []. A further yeast study found that promoter divergence explained only 2-3% of expression variability []. These results suggest that cis-regulatory effects are not a major influence on expression profile. By contrast, a promoter screen in yeast found evidence for a robust correlation between the number of shared motifs and the degree of expression divergence between paralogs [], although, unexpectedly, the absolute number of motifs the paralogs have is approximately constant over time. Clearly, more analysis is needed to investigate this key question in the field of expression pattern evolution.
The numbers of samples in distinct FANTOM5 categories
Cancer cell lines*
Brain tissues a
Reproductive tissues b
Here then we employ this novel data to ask whether expression profiles can be predicted from promoter architecture. In the first instance we wish to know whether the total number of transcription factors binding a promoter is a good predictor. We follow this up with the analysis of interactants and a more complex machine learning approach. We start by resolving basic parameters of TF binding and promoter architecture.
The number of transcription factors per gene follows a power law
The distribution parameters for the number of transcription factor binding sites mapping to proximal promoters depending on the promoter window size and ENCODE quality cutoff
The percentages of genes with 1 TF, 2 TFs, and up to 5 TFs depending on the promoter window size and ENCODE quality cutoff
up to 5 TFs
Effective promoter size is about 6kb (±3000 bps from the TSS)
Broadly expressed genes have more transcription factor binding sites
Correlations and partial correlations
As regards possible chromatin effects we observe (Table 4 and Figure 4), as expected, a positive correlation between BoE and ENCODE DNASE1 signal (Spearman’s rho = 0.19, P-value < 2.2e-16), and a negative correlation between BoE and ENCODE methylation signal (Spearman’s rho = −0.11, P-value < 2.2e-16). There was also a strong correlation of BoE with GC- and CpG-content (rho = 0.33, P-value < 2.2e-16; and rho = 0.42, P-value < 2.2e-16, respectively). There was also a strong correlation between CpG and TfbsNo. (rho = 0.45, P-value < 2.2e-16, Figure 4) and GC-content and the number of Tfb sites (rho = 0.29, P-value < 2.2e-16, Figure 4). Strikingly, however, on multiway partial correlation, the strength of these effects tended to diminish dramatically. Correlation with GC went from a raw correlation of 0.33 to a partial of just 0.03. Correlation with DNASE1 went from 0.19 to just 0.06 and the methyl effect diminished from −0.11 to just −0.04. By contrast the effect of transcription factor number was relatively unchanged (0.48 prior to multiway analysis, 0.4 after). These results suggest that the chromatin effects may mediate the control of gene expression, but the prediction of breadth of expression is best done via transcription factor information (and this is most likely the casual association).
Closer scrutiny of the impact of GC-content as a predictor supports the view that it is GC of the core promoter rather than a more regionalized GC-content that impacts breadth of expression. When we divided promoters into low-GC (less than 50%, n = 5650) and high-GC (more or equal than 50%, n = 25710), the second group had more than three times higher average BoE (the exact ratio was 0.3379/0.099 = 3.41) and on average bound more than four times more TFs (9.55/2.29 = 4.17). The GC-content of proximal promoters (defined as a 1 kbps window) was much higher than that of surrounding DNA sequences (20 kbps window): 0.594 versus 0.463 (Welch Two Sample t-test, P-value < 2.2e-16). Similar results were reported by the ENCODE consortium who found that GC-content of ChIP-seq sites was 61 ± 5% for TSS-proximal peaks []. The exact cause of this effect is not yet fully understood. Although some TF motifs are GC-rich [], these are usually much smaller that the actual ChIP-seq peaks (8–21 bps vs. ∼250 bps).
SVM trained with data on the numbers of interacting Tfbs (SVM-Tfbs) improved on simple correlation, but adding data on GC-content (SVM-Tfbs + GC) did not lead to further improvement of predictions
SVM-Tfbs + GC
We note that we see little or no evidence for a class of genes so highly broadly expressed that they dispense with TFs all-together. In fact, there were only thirty-nine broadly expressed genes with fewer than ten high-quality TFs in a broad 10 kb window around the TSS (Additional file 2: Table S1).
Expression level is not well predicted by Tfbs number
The number of transcription factor binding sites correlated with the breadth of expression, the mean expression, and the median expression, but not with the value of the maximum expression of a transcript
The strength of correlation Tfbs No.
Breadth at the cutoff of 10 TPM
r p = 0.448
t = 88.1194
df = 30873
Breadth at the cutoff of 100 TPM
r p = 0.16
t = 28.6497
df = 30873
Breadth at the cutoff of 1,000 TPM
r p = 0.035
t = 6.1749
df = 30873
r p = 0.13
t = 23.3451
df = 30873
r p = 0.254
t = 46.2675
df = 30873
r p = −0.02
t = −3.5161
df = 30873
r p = −0.041
t = −6.4983
df = 25040
r p = −0.026
t = −4.1194
df = 25040
There is evidence for a real correlation between expression breadth and expression levels if non-parametric statistics are used
Mean vs. breadth (Pearson correlation)parametric statistics
Mean-conditioned-by-breadth vs. breadth (Pearson correlation)parametric statistics
Mean vs. breadth (Spearman correlation)nonparametric statistics
Mean-conditioned-by-breadth vs. breadth (Spearman correlation)nonparametric statistics
r p = 0.33, p < 2.2e-16, T
r p = −0.012, p = 0.0546, T
rho = 0.94, p < 2.2e-16, T
rho = 0.45, p < 2.2e-16, T
r p = 0.26, p < 2.2e-16, PC
r p = 0.03, p = 4.486e-09, PC
rho = 0.95, p < 2.2e-16, PC
rho = 0.5, p < 2.2e-16, PC
r p = 0.34, p < 2.2e-16, CCL
r p = 0.12, p < 2.2e-16, CCL
rho = 0.957, p < 2.2e-16, CCL
rho = 0.44, p < 2.2e-16, CCL
r p = 0.64, p < 2.2e-16, T
r p = −0.00015, p = 0.9892, T
rho = 0.6, p < 2.2e-16, T
rho = 0.41, p < 2.2e-16, T
r p = 0.52, p < 2.2e-16, PC
r p = 0.07, p = 4.264e-12, PC
rho = 0.66, p < 2.2e-16, PC
rho = 0.49, p < 2.2e-16, PC
r p = 0.66, p < 2.2e-16, CCL
r p = 0.22, p < 2.2e-16, CCL
rho = 0.66, p < 2.2e-16, CCL
rho = 0.38, p < 2.2e-16, CCL
r p = 0.76, p < 2.2e-16, T
r p = −0.027, p = 0.4422, T
rho = 0.24, p < 2.2e-16, T
rho = 0.32, p < 2.2e-16, T
r p = 0.82, p < 2.2e-16, PC
r p = 0.021, p = 0.4608, PC
rho = 0.28, p < 2.2e-16, PC
rho = 0.47, p < 2.2e-16, PC
r p = 0.88, p < 2.2e-16, CCL
r p = 0.12, p = 0.00016, CCL
rho = 0.25, p < 2.2e-16, CCL
rho = 0.34, p < 2.2e-16, CCL
The correlation between TF binding sites and expression breadth is robust
The above results strongly support the view that more TF binding is correlated with expression in more tissues. How robust is this result? Is it true in both normal and diseased states? Is it robust to control for whether or not RNA PolII is included in the set of binders? Is it dependent on the assumed size of the promoter? In the three sections below we consider these and other possible confounders.
Correlations are robust to alternative assumptions of the promoter size
Given the decay in the rate of the increase of the number of TFs as the size of promoters expands, we presume that increasing the assumed promoter size should start to cause a decay in the correlation between expression and the number of TFs, just because we are diluting signal (true TF binding) with noise (spurious or unassociated binding). As expected (Figure 2c), the correlation between the breadth of expression and the number of transcription factor binding sites, although robust to the variation in window size, decreases as the size of the analysis window increases. A converse interpretation of this is that Tfbs controlling the breadth of expression are enriched close to the transcription start sites.
The inter-relationship, while strongest for small window sizes, persisted for windows up to 40kb in size (Figure 2c), well beyond the 6kb limit of effective promoter size. We expect this limit to be greater than that derived from the rate of increase of TFs measure (circa 3kb ± TSS) as it takes a considerable dilution of the signal of the TF loaded TSS to remove any correlation. The trend was similar when the ENCODE meta data set from the 2011 freeze [] was compared with the broader 2012 freeze []. Both these data sets were comprehensive in their coverage of transcription factors (148 and 161, respectively). Both data freezes also covered a wide sample space: the earlier freeze with seventy-one cell-types and twenty-four additional experimental cell culture conditions [], and the later freeze with ninety-one human cell types with various treatments []. Finally, the trends detected were robust to alterations in the quality cutoff for ENCODE transcription factor binding sites (Figure 2a-b-c).
The correlations are stronger across cell lines than across gross tissues
To control for the possibility of a sample bias or differences between normal and diseased tissues, we tested whether the correlation between the breadth of expression and the number of transcription factor binding sites held across the entire FANTOM5 sample space. Figure 5a and Figure 6 show the results for human tissues. We confirmed that the trends are also seen for primary cells (Figure 5b and Additional file 3: Figure S1) and cancer cell lines (Figure 5c and Additional file 4: Figure S2). The Pearson correlation coefficient (r p ) between the breadth of expression and the number of transcription factor binding sites equaled 0.53 for primary cells, and 0.61 for cancer cell lines. The correlation is strikingly stronger for cell lines than for tissues. It makes sense that the correlation was stronger for primary cells and cell lines (r 2 ~ 28-37%) than for tissues (r 2 ~ 20%), as tissues are complex mixtures of cell types where some of the cell-type-specific signal might have been lost.
The correlation between the breadth of expression and the number of transcription factor binding sites holds when RNA polymerase II sites are excluded
Above we considered all bindings at promoter regions of genes, including RNA polymerase II binding sites. One might readily object that if one includes PolII binding then highly expressed genes may well have more bindings, if only because they have more PolII. Might then the correlation between the breadth of expression and the number of transcription factor binding sites be driven by RNA polymerase II binding sites, or biased by another abundant transcription factor?
To investigate this we employed four approaches for counting, these being: (1) the total number of binding sites, (2) the number of unique binding sites, (3) the total number of binding sites excluding RNA polymerase II, and (4) the number of unique binding sites excluding the polymerase. We find that the correlation holds regardless of the method (see Figure 6). Indeed, results using these four measures were largely indistinguishable. For example, when polymerase sites were excluded, the correlation between the number of transcription factor binding sites and the breadth of expression in human tissues was 0.434 (t = 84.8393, df = 31093, P-value < 2.2e-16). The correlation was 0.435 when additionally only unique sites were counted (t = 85.1458, df = 31093, P-value < 2.2e-16). In comparison, the original correlation including all sites was 0.448 (t = 88.2645, df = 31093, P-value < 2.2e-16), and when only unique sites were counted 0.45 (t = 88.6194, df = 31093, P-value < 2.2e-16). Indeed, the three derived measures correlated very highly with the original measure with correlation coefficients in pairwise comparisons of 0.994, 0.997, and 0.992 for unique sites, no PolII sites, and unique sites excluding PolII (all P-value < 2.2e-16), respectively.
We can also turn the data the other way around and ask whether sites with more TF bindings also have more PolII. Such a correlation would provide sound evidence that more TFs do indeed result in more transcription. We find this to be the case. After excluding its own sites, the polymerase signal correlated strongly with the total number of transcription factor binding sites (r p = 0.75). The correlation between the breadth of expression and the number of transcription factor binding sites persisted after controlling for the polymerase signal (partial Spearman’s correlation coefficient equaled 0.3).
The divergence of the promoters of paralogs strongly predicts divergence of their expression patterns
As discussed in the introduction, a common method to approach the problem of the degree of promoter-centred control of gene expression has been to ask about the similarity in gene expression of paralogs as a function of the similarity in their promoter domains. In addition to the correlation between the breadth of expression and the number of transcription factor binding sites, we found that the divergence of proximal promoters (measured via a Jaccard Index on Tfbs repertoire – see Methods) correlated strongly with expression divergence, measured by Pearson’s R. The r p for this trend equaled 0.282 when only the youngest paralog pairs were taken into account. The r p was even higher when all daughter pairs were taken into account 0.54 (t = 239.8391, df = 136608, P-value < 2.2e-16). However, the latter comparisons were not fully independent and the results might have been biased by large gene families with a high number of pairwise comparisons. For example, the core histones of H2A@, H2B@, and H3@ families underwent dramatic expansions in placental mammals, resulting in paralogs which are highly co-expressed in proliferating tissues such as the thymus and the testis (manuscript in preparation).
The correlation between the Jaccard index ( JI ) and paralog co-expression was robust in respect to the breadth of expression
The breadth of expression of paralogs
Pearson’s correlation betweenJIand paralog co-expression
Both paralogs were tissue-specific
Both paralogs were intermediate
Both paralogs were housekeeping
One gene tissue-specific, the other housekeeping
Young duplicates were more tissue-specific
Young duplicates had fewer Tfbs regulators
To investigate the origin of new tissue-specific genes, we divided duplication events into three subclasses: “housekeeping conserved” (both paralogs were housekeeping), “tissue − sp. conserved” (both paralogs were tissue-specific), and “transformative” (where one daughter gene was housekeeping while the other was tissue-specific). The relative proportion of “tissue − sp. conserved” events increases for younger taxa indicating that this class of duplication events is responsible for the majority of the increase of tissue-specificity observed for young taxa (Figure 9c and d). This accords with a model suggesting that successful duplication events tend to be those with minimal impact [,]. It also accords with the finding that tissue-specific genes are more likely to belong to large gene families []. The coupling between duplication age and breadth may bias some statistics. If the breadth of expression of a gene in any manner predicts divergence in expression, this bias has the potential to mislead any analysis that considers the degree of divergence between promoters and divergence in expression, as the least diverged duplicates (i.e. the youngest duplicates) will be systematically biased towards the tissue-specific end of the spectrum. However, the trend for gradual expression divergence of paralogs was described in both multicellular [,], and unicellular organisms [] where tissue-specificity cannot be an issue.
Broad expression is associated with specific transcription factors or groups of cooperating factors
First, the breadth of expression was merged into one matrix with the number of ENCODE transcription factor binding sites. Next, a heatmap was drawn for this matrix in order to determine which transcription factors correlated closest with the breadth of expression. i.e. which transcription factors acted as molecular switches for house-keeping expression (Figure 10). The heatmap in Figure 10 uses Pearson’s correlation as the distance measure. Similar results were obtained for human tissues with both the Kendall rank correlation coefficient and Spearman's rank correlation coefficient (Additional file 7: Figure S5 and Additional file 8: Figure S6). We also investigated distance-based correlations such as Euclidean, Manhattan, or Minkovski distances but these measures did not recover any non-trivial clustering.
As expected, the breadth of expression clustered with RNA polymerase II and the transcription initiation factor TFIID — TAF1. This was perhaps unsurprising as the polymerase and TAF1 are constitutive components of the transcription apparatus. More interestingly, in human tissues, the breadth of expression also clustered closely with Mxi1, YY1, NFKB, HEY1, Sin3A, and c − Myc (this cluster was marked with an A in Figure 10) suggesting these transcription factors were among control switches. In human primary cells, the breadth of expression also clustered with the polymerase, TFIID, NFKB, HEY1, Sin3A, and c − Myc (Additional file 10: Figure S3). However, in cancer cell lines, the breadth of expression only clustered with RNA polymerase II and TFIID (Additional file 11: Figure S4) suggesting that cancerous transformation interferes with the majority, except the most rudimentary, control switches for the breadth of expression.
Other transcription factors formed two clusters with either low or high distance to the breadth of expression (these clusters were marked with a B and a C in Figure 10). These two clusters could be enriched in either housekeeping transcription factors (B), or tissue-specific transcription factors (C). In cancer cell lines (Additional file 11: Figure S4), clusters with low (B) and high (C) distance to the breadth of expression could be enriched in oncogenic transcription factors, and anti-oncogenic or tumor-specific transcription factors, respectively.
Many clusters of co-interacting transcription factors can be inferred from Figure 10. These clusters were marked with lowercase letters: (a) Nanog and Pou5f1; (b) Srebp1 and Srebp2; (c) STAT1-3; (d) mef2a and mef2b; (e) GATA-1 and GATA-2; (f) MafF and MafK; (g) BAF170 and BAF155; (h) AP-2 alpha and gamma; (i) FOS and FOSL2; (j) Jun and JunD; (k) FOXA1 and FOXA2; (l) HNF4A and HNF4G; (m) CTCF targeted by three different antibodies; (n) Rad21, SMC3 and CTCFL; (o) Pol3, BRF1, RPC155, and BDP1; (p) E2F1, E2F4, and E2F6; (q) SIX5, Znf143 and ETS1; (r) ELK4 and ELF1; (s) PAX5 targeted by two different antibodies; (t) ZBTB33, BRCA1 and CHD2; (u) NELFe, GTF2B, and TAF7; (v) POU2F2 and Oct-2; (w) NF-YB, NF-YA, c-Fos and SP1; (x) USF1 and USF2; (y) PolII and TAF1. Some apparent clusters are the same transcription factors targeted by different antibodies (for example, the CTCF rabbit polyclonal, the CTCF_C − 20 goat polyclonal, and the CTCF_SC − 5916 goat polyclonal antibody) and so should be disregarded. By contrast, some of the clusters are known cooperative complexes. For example, Rad21 and SMC3 form the cohesin complex. The cooperation between Nanog and Pouf1 (alias Oct4) is well described []. Other clusters suggest entirely new molecular interactions which should provide material for experimental verification.
A support vector machine (SVM) method predicts the breadth of expression better than the correlation alone
Given the evidence for cooperativity between TFs in their association with breadth of expression and for key control TFs, we would expect that more sophisticated statistical tools should improve the predictive ability of a model relating TF binding to expression breadth. To address this we established a machine learning approach. Our intention here is not to produce a better statistical model by incorporating non-causal (e.g. rate of protein evolution) and causal (TF binding) predictors of expression breadth. Rather we simply wish to ask whether incorporation of the likely causal factors in a more sophisticated statistical framework permits better understanding of the control of expression breadth.
The support vector machine is a commonly used machine learning approach to prediction. We trained the support vector machine using a randomly chosen half of the dataset. Each row of the basic SVM training data-frame was a vector representing the number of Tfbs of each gene; although, we also considered SVMs trained using promoter GC- and CpG-contents (Table 5). The resulting SVM model was then applied to predict the value of the breadth of expression for the other half of the dataset. Note that in this mode of operation with continuous data being predicted, an SVM is best regarded as a form of non-linear regression (rather than a discrete classifier) and hence the appropriate metric for considering its ability is the correlation between the observed and predicted variable (rather than AUC, accuracy, etc.). Such a correlation-based appraisal also permits direct comparison with the simple correlation approach examined above.
As noted above, in cancer cell lines, the correlation between the breadth of expression and the number of transcription factor bindings sites is 0.61. The SVM’s prediction improved on this and correlated with the observed value of the breadth of expression with r p = 0.7460 (Table 5). Results described herein were obtained using the promoter window of 1000 base pairs but essentially identical values were obtained using the promoter window of 4000 base pairs (data not shown). As negative and positive controls we used a teaching dataset where the response variable was scrambled or included as one of the training features, respectively. The SVM is thus capable of explaining 58% of the variation, as opposed to the correlation method’s 37%. The prediction accuracy was not due to polymerase signal alone since the SVM performed equally well when TAF1 and all types of polymerase II sites were removed (r p = 0.759). Moreover, the predictor did not simply rely on summing up of Tfbs, since an SVM trained using only the sums of Tfbs as features (Tfbs_x_length, Tfbs_x_length_unique, Tfbs_x_length_noPol2, Tfbs_x_length_unique_noPol2) did not improve on the simple correlation (r p = 0.646, r 2 = 0.42). The number of support vectors was high (approximately 60% of the training cases) suggesting that no simple discriminatory features could be found. Taken together, these results underlined a cooperative effect of many Tfbs acting together and clustering in a narrow window around the TSS to control the breadth of expression.
Are promoters in open chromatin “sticky”?
Even though specific interactions can be identified, that a simple correlation approach can capture so much is striking. The fact that a correlation approach works is consistent with the notion that most transcription factors in eukaryotes are activators. There is, however, an alternative interpretation of the correlation between number of TFs and increasing expression, this being that it reflects more a passive process of spurious TF binding. A simple model in which open/transcribed chromatin is to some degree “sticky” could in principle predict the same correlation. At the limit there might be a single TF that forces broad expression, but because of its presence other TFs are recruited, not because they are needed, but because TFs might be attracted to open chromatin and transcriptional hotspots as iron filings are attracted to a magnet. In this context, were for example GC rich sequence “sticky” for transcription factors, this might explain the GC-TF correlations. However, such a “sticky” model would require strong binding of the TFs to the DNA, rather than weak and short-lived non-specific interactions which are most unlikely to resist the processing of the TF-DNA interaction in ChIP-seq methodology.
Several facts argue against the “sticky” model. First, our approach has recovered known interacting complexes, such as cohesin and CTCF [,], suggesting that much of the signal is owing to functional rather than spurious effects. Furthermore the correlation between breadth and number of transcription factors is more profound for transcripts with fewer than twenty transcription factor binding sites (r p = 0.42), than it is for transcripts with more than twenty (r p = 0.098). We would expect the opposite if Tfbs simply accumulated in a runaway positive feedback loop.
A further argument against the “sticky” model of Tfbs binding comes from the examination of the breadth of expression of ENCODE Tfbs and their targets. We detected a strong correlation between the average breadth of expression of Tfbs targets and the breadth of expression of respective regulating Tfbs (Additional file 5: Table S3, Additional file 12: Figure S8) with the Spearman rho = 0.3176 (P-value = 0.0001242). Tissue-specific Tfbs bind on average more tissue-specific genes (mean BoE of targets 0.394), than housekeeping Tfbs (mean BoE = 0.468), Wilcoxon rank sum test P-value = 6.805e-05. Under the pure “sticky” model, we would expect no correlation. The picture is not clear-cut, however. Clearly, tissue-specific transcription factors can have targets in promoters of housekeeping genes. It is possible that housekeeping expression in certain tissue-types demands activation by this tissues unique transcription factors in addition to the standard set of TFs mediating housekeeping expression. Finally, TFs may differ in the degree of “stickiness”. Some TFs may be “sticky” towards open chromatin, while others bind very specifically, perhaps playing a key role in the initial act of the opening of the chromatin.
Another argument against the “sticky” model is that open/close chromatin is only a binary signal which cannot account for all the complexities of spatiotemporal gene regulation in Vertebrates. Moreover, the correlation between the breadth of expression and DNASE1 sensitivity, an excellent marker of open chromatin, was much weaker than that between the breadth of expression and the number of potential interacting transcription factors (Table 4).
Yet another argument against the sticky model is that TfbsNo. describes the general affinity landscape of promoters for Tfbs binding (i.e. the ability to bind TFs). Which TFs are actually bound will depend on a particular tissue and may vary greatly. This is because ENCODE inputs were derived datasets, where individual peaks from different tissues were merged if tagged to the same transcription factor binding site. In fact, counting Tfb sites across different tissues would be a source of a logical error, as a circular association between the breadth of expression and the presence of Tfb sites in multiple tissues would obscure any causal connections.
While multiple lines of evidence argue against the “sticky” chromatin model as the best explanation for the correlation between breadth of expression and number of TF binding sites we cannot entirely refute the hypothesis. Indeed scrutiny of the number of TFs binding the very most broadly expressed genes (right most column in Figure 5e) suggests that these have slightly more TFs than expected given the numbers in the prior bins. There also remains the possibility that the TFs that drive broad expression are themselves “sticky” and attract more TFs. Such a model is more ad hoc than the “sticky” chromatin model, having to make the extra assumption that only TFs associated with broad expression are “sticky”, for which we see no a priori defense. Moreover, this model can’t simply explain some of the above results, such as the stronger correlation for the less broadly expressed genes.
Prediction of the tissue of expression is moderate at best
Above we have asked if we can predict breadth from knowledge of TFBS. A possibly harder question, is whether contained within the promoter architecture is evidence of which tissue or tissues a narrowly expressed gene is expressed in (rather than just the narrowness of that expression). In principle a machine learning approach might be devised to employ the data that we have assembled to tackling such an issue. We took a similar approach to the SVM given above to predict tissues-specific expression (i.e., breadth of expression). The main difference was that the continuous response variable was the preferential expression measure (PEM) in a given tissue instead of the breadth of expression.
We find that SVMs trained using a matrix of TF frequencies can at best modestly predict preferential expression for some tissues (such as the brain and the adipose tissue, the liver, the lung, and the breast) with the correlation between the predicted and the observed expression of up to r p = 0.36 for brain (Figure 13) (note that as before the SVM is trained against continuous data so the appropriate metric of accuracy is a correlation coefficient). That brain was the best predicted may well reflect the fact that brain is also the tissue with the highest number of preferentially-expressed genes. Indeed, we see an overall correlation between predictive ability (correlation strength) and the mean degree of preferential expression (PEM avg ) for genes expressed in any given tissue (rho = 0.54). That is to say, the tissues for which we fail to predict the preferential expression of transcripts, are the tissues with few preferentially expressed transcripts (e.g. thyroid, salivary gland, skin, bone marrow). Overall, we conclude that at best we have only a moderate ability to predict degree of preferential expression of genes in any given tissue and that this diminishes greatly when the tissues themselves have few tissue specific genes.
An alternative binary classifier for the prediction of tissue-specific expression
An alternative approach to prediction of tissue-specific (i.e. narrow) expression is to divide the transcripts into two categories: tissue-specific and not tissue-specific, at the cutoff of the breadth of expression of 0.33 (i.e., one third of tissues). The approach taken here is identical to the one described above (see: A support vector machine method predicts the breadth of expression better than the correlation alone) except that the input and output measures of the breadth of expression were discretized. Standard validation charts for such a SVM-based binary classifier are shown in Additional file 13: Figure S9. The area under the ROC curve equaled 0.816. Inclusion of GC content does not improve this SVM’s predictive ability, confirming the irrelevance of this feature.
The results above support the view that, to a considerable degree, components of the expression profile, notably expression breadth, and in turn expression breadth divergence, can be predicted from knowledge of the TF binders of a given gene. That such a result wasn’t, for the most part, captured previously, may be explained as limitations in the prior data. That we can recover the breadth of expression result using the previously available Gene Expression Atlas [], suggests that it is the high resolution ENCODE transcription factor binding data that is key. Given that the “sticky” chromatin model fails to make a parsimonious explanation of the data, we conclude that the data support the suggestion that most eukaryotic TFs are activating.
Here, while touching on expression level, we have concentrated on expression breadth and divergence in breadth. Our results, however, suggest a series of further questions. How, for example, are we to interpret the evidence for cooperation between partners not previously known to be cooperative? Assuming antibody cross-reactivity is not the explanation, these statistically associated TFs suggest experimental tests for apparent cooperativity. Our analysis of whether it is possible to predict the tissue of expression of tissue specific genes suggests that a support vector approach has limited success at best. Given that we have the best success for the tissue (brain) with the most data, it may yet prove to be the case that with better techniques and more data this problem may become more tractable.
A further open issue is whether we can extrapolate our results to non-human species. An ability to infer mutational changes in promoters that are likely to have a major impact on expression, in poorly studied close relatives would be of considerable value in determining those promoters and TFs that might be (or have been) under selection in humans to switch gene expression “on” or “off” in tissues key to human uniqueness. Given the above result (or the moderate ability to predict brain specific genes) there may be some prospect of identifying some lineage specific changes that might have affected human brains.
We present evidence that expression breadth and paralog expression divergence are strongly predictable with knowledge of transcription factor binding in the proximal promoter. A simple metric, the number of binding transcription factors found on a promoter, is a robust predictor of expression breadth in human tissues. However, prediction of the tissue of expression is moderate at best.
(T) human.tissue.hCAGE.hg19.tpm.refgene.osc.txt - human tissues;
(PC) human.primary_cell.hCAGE.hg19.tpm.refgene.osc.txt - human primary cells;
(CCL) human.cell_line.hCAGE.hg19.tpm.refgene.osc.txt - human cancer cell lines.
These were tab-delimited text files with the header section describing columns, library names, and the total number of reads for each library. These files were pre-processed using standard methods to facilitate their import to the R statistical environment with the read.table command with row-names being RefSeq accession numbers and column-names identifying tissue, cell-type, or cell line from which CAGE tags were sequenced.
ChIP-seq data were described in the section below (Assembling TF binding data to define promoter architecture). The links to the files and description of inputs can be found in the following webpage references [,,,]. ENCODE Dnase sites, which mark regions of accessible chromatin across the ENCODE set of cell-lines [,], and methylation signal for the HeLa cell line were also retrieved from ENCODE [].
Gene Expression Atlas
Gene Expression Atlas (GEA) data [] were used to independently confirm the correlation between the number of mapping transcription factors and the breadth of expression. We see this as valuable that this strong trend could be detected using two very different expression technologies: one based on next generation sequencing (FANTOM5) and a second one based on microarray hybridization (GEA).
Assembling TF binding data to define promoter architecture
The 2011 meta data set included 2,750,490 ChIP-seq peaks for 148 transcription factors, derived from seventy-one cell-types with twenty-four additional experimental cell culture conditions []. Peaks were called and merged using UCSC clustering tools (encodeMergeReplicates, regClusterBedExpCfg and hgBedsToBedExps). Crucially, this procedure merges peaks for the same TF across replicates, cell lines, and from different labs into one. Peak scores varied between zero to 1000 (proportionately to the Tfbs prediction reliability). We used either all data, or only high-quality peaks with the score over half the maximum (i.e., 500) to avoid noisy data associated with multi-mapping next-generation sequencing tags. For the correlation between the breadth of expression and the number of transcription factor binding sites, we verified that results analogous to those for the January, 2011, data-freeze [] were obtained using a broader September, 2012, data-freeze []. The 2012 data-freeze consisted of 161 transcription factors and ninety-one human cell types with various treatment conditions []. The ENCODE preprocessing pipeline merges all overlapping binding sites for a TF in different cell lines into a single site. Therefore, the cumulative metric which we calculate corresponds to the total capacity of a promoter to bind TFs (i.e., promoter architecture), rather than to the actual number of sites in any particular cell- or tissue-type or across the sample space. This is crucial, as our approach relies on defining the promoter architecture, that is to say the binding landscape, rather than summing up Tfbs across the tissues which would be dangerously tautologous to measuring breadth of expression. However, we took great care not to fall into this trap. On top of the ENCODE’s merging procedure, we employ an additional safeguard by introducing a metric which only counts each TF once even if the promoter can bind this TF at alternative genomic locations (i.e., Tfbs_x_length_unique). The correlation between Tfbs_x_length and Tfbs_x_length_unique was almost perfect (r p = 0.9929) and these two measures supported identical biological conclusions in all analyses (ensuring that no single Tfbs could introduce a bias).
The Jaccard index for promoter divergence
The Jaccard index (JI) measured the overlap between two sets of genomic features []. The index was calculated as the ratio of the intersection over the union (JI = I/U). JI equaled one when the intersection equaled the union (that is when there was a perfect overlap). JI equaled zero when there was no overlap.
FANTOM5 gene expression tables
CAGE tags were mapped to RefSeq transcripts +/− 500 bps from their transcription start sites. The numbers of tags were normalized to tags per million (TPM). Finally, the TPM value of ten was chosen as the default cutoff for a gene to be “on” (unless stated otherwise). We downloaded RefSeq in BED format from the hg19 UCSC genome browser using the table browser tool. This dataset included 40,856 human transcripts, including messenger RNAs (NM-accession) and non-coding transcripts (NR-accession). Non-coding transcripts included structural RNAs and transcribed pseudogenes. The dataset was later processed with BEDtools [].
Gene Expression Atlas
Gene Expression Atlas [] was employed to confirm the correlation between the breadth of expression and the number of transcription factor binding sites in proximal promoters (Figure 5d). Affymetrix average difference (AD) higher that 200 classified a gene as “on” or expressed in a given tissue. Affymetrix ids were mapped to RefSeq ids using R annotation object hgu95aREFSEQ from the hgu95a.db package. ENCODE transcription factors were mapped to RefSeq just as in the FANTOM5 analysis.
TreeFam and the inference of gene duplications
TreeFam [] used evolutionary histories of individual genes to construct phylogenetic trees and to date gene duplications []. In our hands, TreeFam consistently delivered high quality phylogenetic trees which were congruent with the insights of molecular biologists [,].
TreeBeST was the tree-building engine behind TreeFam. TreeBeST merged a maximum likelihood tree from PHYML [] with neighbor-joining trees based on P-distance, Ka, and Ks. TreeBeST used smart heuristics intended to maximize the similarity between the gene tree and the species tree and to minimize the number of predicted gene duplications and losses.
TreeFam release eight was based on Ensembl version 54 and included 79 species, 1,539,621 genes, and 16,064 families. Phylogenetic timing was used to associate gene duplications with the emergence of different taxa. The Vertebrata and the Bilateria were consistently linked with the most numerous waves of duplications in animals []. The algorithm for speciation and duplication inference (SDI) reconciled gene trees with the species dendrogram. SDI also inferred duplications, speciation events, paralogy, and orthology.
The comparison of all paralog pairs versus the comparison of youngest pairs only
Combinatorics explains why there are more possible paralog comparisons than paralogs. In a set of n elements, the number of k-combinations was equal to the binomial coefficient ("n choose k"). K equals two for pairwise comparisons. For example, there were ten possible pairwise comparisons in a family of five paralogs. There were 45 legitimate pairwise comparisons in a family of ten paralogs. However, these comparisons might not be regarded as independent data points, and the results might have been biased by a few large gene families. The best alternative was to perform a comparison between each paralog with only its closest relative (i.e., only the most recent duplicate).
The dataset used in the preparation of this study was released to the public domain as an R package called the Duplicator []. R data-frames with data on duplications mapping to different taxa can be found in R helper environments for the package (env_duplicator_base and env_duplicator_vectors), which are located in the path duplicator/data/duplicator. Naming conventions indicate the species of origin. For instance, the R data-frames containing human data are called: dupEvent12_genes_LL_hs (individual genes) and dupEvent12_genes_LL_hs2 (paralog pairs).
Data-frames in the package follow the denormalized data model. For example, the R data-frame dupEvent12_genes_LL_hs2 has the following fields: family (TreeFam family ID), node (unique identifier for each gene duplication node), taxon (taxon of duplication), gene.x (ENSEMBL ENSG ID for paralog x), familySide.x, primary_acc.x (Entrez IDs for paralog x), gene.y (ENSEMBL ENSG ID for paralog y), familySide.y, primary_acc.y (Entrez IDs for paralog y). FamilySide was a flag with values of one or two defining on which side of the duplication node the gene was located; this flag was used to prevent multiple comparisons on the “same side” of the primary duplication node if later duplications occurred.
The analytical pipeline
Data retrieval and remodeling (the pre-BEDtools stage). This stage consisted of data retrieval from TreeFam and remodeling into R data-frames. Helper environments held these data-frames for later use, and stored intermediate results.
The detection of the overlap between promoters and transcription factor binding sites (the BEDtools stage). We checked if two sets of genomic features overlapped using BEDtools. We used mostly coverageBed and intersectBed. CoverageBed calculates the depth of coverage of features in the file A against the file B. Herein, coverageBed was used to calculate the depth of coverage by ENCODE transcription factor binding sites in promoters (coverageBed -a ENCODE_Tfbs -b RefSeq_promoters > result). IntersectBed writes the original entry in the file A for each overlap when used with the “-wa” option (intersectBed -a ENCODE_Tfbs -b RefSeq_promoters -wa > result). This identified the exact type of transcription factor binding sites in the overlap. An alternative analysis pipeline was constructed using R/BioC packages GenomicRanges and IRanges. However, the R/BioC pipeline proved prohibitively slow even on SNIC Supercomputers. The alternative pipeline was used only to verify the results of the main BEDtools-based pipeline.
Parsing, statistical tests, and figure generation (the post-BEDtools stage). Post-processing was performed in R and Bioconductor (version 2.11). Standard Bioconductor packages such as Biodist, gplot, ggplot, rtracklayer, TxDb.Hsapiens.UCSC.hg19.knownGene, and GOstats were used.
The post-BEDtoolsanalysis was divided into the four steps listed below.
Step (1): Parsing BEDtools output into an R list.
BEDtools output in the browser extensible data (BED) file format was read into R as a data-frame. The data-frame was then remodeled into an R list called resJaccard_ENCODE_Tfbs_substrate. Each element of the list was indexed with a RefSeq transcript id as the accession key. A vector of ENCODE transcription factor binding sites was contained within each element of the list. Transcription factor parsing and creation of the list were performed by the script called env_bed_jaccard_make_1.R. Intermediate results were stored in the R helper environment named env_promoter_bed_500_res_env.
Step (2): The Jaccard index calculation.
The Jaccard index was calculated using the script resJaccard_ ENCODE_Tfbs_substrate. Additionally, a randomized dataset was generated using sampling without replacement. The randomized dataset was used to calculate a control distribution of the Jaccard index.
Step (3): Additional calculations.
The following additional variables were calculated: relative frequencies of different transcription factor binding sites, intersection, union, and the Jaccard index depending on the taxon of duplication. These calculations were performed by scripts called env_jaccard_analyse_result_bed_sampled.R and env_jaccard_analyse_random_bed_sampled.R.
Step (4): Figure generation.
Manuscript figures were generated automatically using ggplot2 from intermediate results stored in R helper environments.
The support vector machine
The support vector machine is a commonly used machine learning technique for regression. We used the R package e1071, an R implementation of libsvm. Each row of the basic SVM training data-frame was a vector representing the number of Tfbs for each gene; although, we also considered SVMs trained using promoter GC- and CpG-contents (Table 5). The training dataset was scaled and centered. We used a radial kernel as a regression machine. After a grid search for optimal parameter values, the following SVM parameters were used: cost = 1, gamma = 0.01, and epsilon = 0.1. The continuous response variable was either the breadth of expression (BoE) or the preferential expression measure (PEM).
Access to FANTOM5 is provided at the FANTOM5 public website, including the UCSC genome browser mirror, and FANTOM5’s own CAGE-focused ZENBU genome browser.
Acknowledgments and funding
Lukasz Huminiecki was partially funded by Bioinformatics Infrastructure for Life Sciences (BILS). BILS was a distributed national research infrastructure supported by the Swedish Research Council (Vetenskapsrådet), providing bioinformatics support to life science researchers in Sweden. FANTOM5 was made possible by a Research Grant for RIKEN Omics Science Center from MEXT to Yoshihide Hayashizaki and a Grant of the Innovative Cell Biology by Innovative Technology (Cell Innovation Program) from the MEXT, Japan to YH. We would like to thank all members of the FANTOM5 consortium for contributing to generation of samples and analysis of the data-set and thank GeNAS for data production. The Swedish National Infrastructure for Computing (SNIC) provided computational resources for the Swedish research community. Our computations were performed on SNIC resources, UPPMAX and kappa.
- Makova KD, Li WH: Divergence in the spatial pattern of gene expression between human duplicate genes. Genome Res. 2003, 13: 1638-1645. 10.1101/gr.1133803.PubMedPubMed CentralView ArticleGoogle Scholar
- Huminiecki L, Wolfe KH: Divergence of spatial gene expression profiles following species-specific gene duplications in human and mouse. Genome Res. 2004, 14: 1870-1879. 10.1101/gr.2705204.PubMedPubMed CentralView ArticleGoogle Scholar
- Struhl K: Fundamentally different logic of gene regulation in eukaryotes and prokaryotes. Cell. 1999, 98: 1-4. 10.1016/S0092-8674(00)80599-1.PubMedView ArticleGoogle Scholar
- Ptashne MGA: Genes and signals. 2002, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New YorkGoogle Scholar
- Segal E, Raveh-Sadka T, Schroeder M, Unnerstall U, Gaul U: Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature. 2008, 451: 535-U531. 10.1038/nature06496.PubMedView ArticleGoogle Scholar
- Tirosh I, Weinberger A, Bezalel D, Kaganovich M, Barkai N: On the relation between promoter divergence and gene expression evolution. Mol Syst Biol. 2008, 4: 159-PubMedPubMed CentralView ArticleGoogle Scholar
- Ge Y, Porse BT: The functional consequences of intron retention: Alternative splicing coupled to NMD as a regulator of gene expression. Bioessays. 2013, 36: 236-243.PubMedView ArticleGoogle Scholar
- Stroynowska-Czerwinska A, Fiszer A, Krzyzosiak WJ: The panorama of miRNA-mediated mechanisms in mammalian cells. Cell Mol Life Sci. 2014, 71: 2253-2270. 10.1007/s00018-013-1551-6.PubMedPubMed CentralView ArticleGoogle Scholar
- Cheadle C, Fan J, Cho-Chung YS, Werner T, Ray J, Do L, Gorospe M, Becker KG: Control of gene expression during T cell activation: alternate regulation of mRNA transcription and mRNA stability. BMC Genomics. 2005, 6: 75-10.1186/1471-2164-6-75.PubMedPubMed CentralView ArticleGoogle Scholar
- Gilbert N, Boyle S, Fiegler H, Woodfine K, Carter NP, Bickmore WA: Chromatin architecture of the human genome: Gene-rich domains are enriched in open chromatin fibers. Cell. 2004, 118: 555-566. 10.1016/j.cell.2004.08.011.PubMedView ArticleGoogle Scholar
- Chen T, Dent SY: Chromatin modifiers and remodellers: regulators of cellular differentiation. Nat Rev Genet. 2014, 15: 93-106.PubMedPubMed CentralView ArticleGoogle Scholar
- Agalioti T, Chen G, Thanos D: Deciphering the transcriptional histone acetylation code for a human gene. Cell. 2002, 111: 381-392. 10.1016/S0092-8674(02)01077-2.PubMedView ArticleGoogle Scholar
- Gierman HJ, Indemans MH, Koster J, Goetze S, Seppen J, Geerts D, van Driel R, Versteeg R: Domain-wide regulation of gene expression in the human genome. Genome Res. 2007, 17: 1286-1295. 10.1101/gr.6276007.PubMedPubMed CentralView ArticleGoogle Scholar
- Raj A, Peskin CS, Tranchina D, Vargas DY, Tyagi S: Stochastic mRNA synthesis in mammalian cells. Plos Biology. 2006, 4: e309-10.1371/journal.pbio.0040309.PubMedPubMed CentralView ArticleGoogle Scholar
- Ebisuya M, Yamamoto T, Nakajima M, Nishida E: Ripples from neighbouring transcription. Nat Cell Biol. 2008, 10: 1106-1113. 10.1038/ncb1771.PubMedView ArticleGoogle Scholar
- Batada NN, Urrutia AO, Hurst LD: Chromatin remodelling is a major source of coexpression of linked genes in yeast. Trends in genetics : TIG. 2007, 23: 480-484. 10.1016/j.tig.2007.08.003.PubMedView ArticleGoogle Scholar
- Park J, Xu K, Park T, Yi SV: What are the determinants of gene expression levels and breadths in the human genome?. Hum Mol Genet. 2012, 21: 46-56. 10.1093/hmg/ddr436.PubMedPubMed CentralView ArticleGoogle Scholar
- Illingworth RS, Bird AP: CpG islands–'a rough guide'. FEBS Lett. 2009, 583: 1713-1720. 10.1016/j.febslet.2009.04.012.PubMedView ArticleGoogle Scholar
- Borneman AR, Gianoulis TA, Zhang ZD, Yu H, Rozowsky J, Seringhaus MR, Wang LY, Gerstein M, Snyder M: Divergence of transcription factor binding sites across related yeast species. Science. 2007, 317: 815-819. 10.1126/science.1140748.PubMedView ArticleGoogle Scholar
- Park C, Makova KD: Coding region structural heterogeneity and turnover of transcription start sites contribute to divergence in expression between duplicate genes. Genome Biol. 2009, 10: R10-10.1186/gb-2009-10-1-r10.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang Z, Gu J, Gu X: How much expression divergence after yeast gene duplication could be explained by regulatory motif evolution?. Trends in genetics : TIG. 2004, 20: 403-407. 10.1016/j.tig.2004.07.006.PubMedView ArticleGoogle Scholar
- Papp B, Pal C, Hurst LD: Evolution of cis-regulatory elements in duplicated genes of yeast. Trends in genetics : TIG. 2003, 19: 417-422. 10.1016/S0168-9525(03)00174-4.PubMedView ArticleGoogle Scholar
- Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, Epstein CB, Frietze S, Harrow J, Kaul R, Khatun J, Lajoie BR, Landt SG, Lee BK, Pauli F, Rosenbloom KR, Sabo P, Safi A, Sanyal A, Shoresh N, Simon JM, Song L, Trinklein ND, Altshuler RC, Birney E, Brown JB, Cheng C, Djebali S, Dong X, Ernst J, et al: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489: 57-74. 10.1038/nature11247.View ArticleGoogle Scholar
- Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, Bernstein BE, Bickel P, Brown JB, Cayting P, Chen Y, DeSalvo G, Epstein C, Fisher-Aylor KI, Euskirchen G, Gerstein M, Gertz J, Hartemink AJ, Hoffman MM, Iyer VR, Jung YL, Karmakar S, Kellis M, Kharchenko PV, Li Q, Liu T, Liu XS, Ma L, Milosavljevic A, Myers RM, et al: ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012, 22: 1813-1831. 10.1101/gr.136184.111.PubMedPubMed CentralView ArticleGoogle Scholar
- ENCODE-TfbsV2. ᅟ, ᅟ, ᅟGoogle Scholar
- ENCODE-TfbsV3. ᅟ, ᅟ, ᅟGoogle Scholar
- Forrest AR, Kawaji H, Rehli M, Baillie JK, de Hoon MJ, Lassmann T, Itoh M, Summers KM, Suzuki H, Daub CO, Kawai J, Heutink P, Hide W, Freeman TC, Lenhard B, Bajic VB, Taylor MS, Makeev VJ, Sandelin A, Hume DA, Carninci P, Hayashizaki Y: A promoter-level mammalian expression atlas. Nature. 2014, 507: 462-470. 10.1038/nature13182.PubMedView ArticleGoogle Scholar
- Velculescu VE, Madden SL, Zhang L, Lash AE, Yu J, Rago C, Lal A, Wang CJ, Beaudry GA, Ciriello KM, Cook BP, Dufault MR, Ferguson AT, Gao Y, He TC, Hermeking H, Hiraldo SK, Hwang PM, Lopez MA, Luderer HF, Mathews B, Petroziello JM, Polyak K, Zawel L, Kinzler KW: Analysis of human transcriptomes. Nat Genet. 1999, 23: 387-388.PubMedView ArticleGoogle Scholar
- Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, Rando OJ, Birney E, Myers RM, Noble WS, Snyder M, Weng Z: Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 2012, 22: 1798-1812. 10.1101/gr.139105.112.PubMedPubMed CentralView ArticleGoogle Scholar
- Spivakov M, Akhtar J, Kheradpour P, Beal K, Girardot C, Koscielny G, Herrero J, Kellis M, Furlong EE, Birney E: Analysis of variation at transcription factor binding sites in Drosophila and humans. Genome Biol. 2012, 13: R49-10.1186/gb-2012-13-9-r49.PubMedPubMed CentralView ArticleGoogle Scholar
- ENCODE-InputsV2. ᅟ, ᅟ, ᅟGoogle Scholar
- ENCODE-InputsV3. ᅟ, ᅟ, ᅟGoogle Scholar
- Makova KD, Li WH: Divergence in the spatial pattern of gene expression between human duplicate genes. Genome Res. 2003, 13: 1638-1645. 10.1101/gr.1133803.PubMedPubMed CentralView ArticleGoogle Scholar
- Huminiecki L, Wolfe KH: Divergence of spatial gene expression profiles following species-specific gene duplications in human and mouse. Genome Res. 2004, 14: 1870-1879. 10.1101/gr.2705204.PubMedPubMed CentralView ArticleGoogle Scholar
- Jordan IK, Marino-Ramirez L, Koonin EV: Evolutionary significance of gene expression divergence. Gene. 2005, 345: 119-126. 10.1016/j.gene.2004.11.034.PubMedPubMed CentralView ArticleGoogle Scholar
- Pereira V, Waxman D, Eyre-Walker A: A problem with the correlation coefficient as a measure of gene expression divergence. Genetics. 2009, 183: 1597-1600. 10.1534/genetics.109.110247.PubMedPubMed CentralView ArticleGoogle Scholar
- Piasecka B, Robinson-Rechavi M, Bergmann S: Correcting for the bias due to expression specificity improves the estimation of constrained evolution of expression between mouse and human. Bioinformatics. 2012, 28: 1865-1872. 10.1093/bioinformatics/bts266.PubMedPubMed CentralView ArticleGoogle Scholar
- He X, Zhang J: Higher duplicability of less important genes in yeast genomes. Mol Biol Evol. 2006, 23: 144-151.PubMedView ArticleGoogle Scholar
- Woods S, Coghlan A, Rivers D, Warnecke T, Jeffries SJ, Kwon T, Rogers A, Hurst LD, Ahringer J: Duplication and Retention Biases of Essential and Non-Essential Genes Revealed by Systematic Knockdown Analyses. Plos Genetics. 2013, 9: e1003330-10.1371/journal.pgen.1003330.PubMedPubMed CentralView ArticleGoogle Scholar
- Gu Z, Nicolae D, Lu HH, Li WH: Rapid divergence in expression between duplicate genes inferred from microarray data. Trends in genetics : TIG. 2002, 18: 609-613. 10.1016/S0168-9525(02)02837-8.PubMedView ArticleGoogle Scholar
- Loh YH, Wu Q, Chew JL, Vega VB, Zhang W, Chen X, Bourque G, George J, Leong B, Liu J, Wong KY, Sung KW, Lee CW, Zhao XD, Chiu KP, Lipovich L, Kuznetsov VA, Robson P, Stanton LW, Wei CL, Ruan Y, Lim B, Ng HH: The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat Genet. 2006, 38: 431-440. 10.1038/ng1760.PubMedView ArticleGoogle Scholar
- Gause M, Schaaf CA, Dorsett D: Cohesin and CTCF: cooperating to control chromosome conformation?. Bioessays. 2008, 30: 715-718. 10.1002/bies.20787.PubMedView ArticleGoogle Scholar
- Wendt KS, Peters JM: How cohesin and CTCF cooperate in regulating gene expression. Chromosome Res. 2009, 17: 201-214. 10.1007/s10577-008-9017-7.PubMedView ArticleGoogle Scholar
- Factorbook. ᅟ, ᅟ, ᅟGoogle Scholar
- Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, Moqrich A, Patapoutian A, Hampton GM, Schultz PG, Hogenesch JB: Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci U S A. 2002, 99: 4465-4470. 10.1073/pnas.012025199.PubMedPubMed CentralView ArticleGoogle Scholar
- WP4 expression tables. ᅟ, ᅟ, ᅟGoogle Scholar
- Primary FANTOM5 CAGE data. ᅟ, ᅟ, ᅟGoogle Scholar
- Dnase sites slustered. ᅟ, ᅟ, ᅟGoogle Scholar
- Dnase-Inputs. ᅟ, ᅟ, ᅟGoogle Scholar
- Methylation in HeLa. ᅟ, ᅟ, ᅟGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004, 101: 6062-6067. 10.1073/pnas.0400782101.PubMedPubMed CentralView ArticleGoogle Scholar
- Favorov A, Mularoni L, Cope LM, Medvedeva Y, Mironov AA, Makeev VJ, Wheelan SJ: Exploring massive, genome scale datasets with the GenometriCorr package. PLoS Comput Biol. 2012, 8: e1002529-10.1371/journal.pcbi.1002529.PubMedPubMed CentralView ArticleGoogle Scholar
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26: 841-842. 10.1093/bioinformatics/btq033.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Coghlan A, Ruan J, Coin LJ, Heriche JK, Osmotherly L, Li R, Liu T, Zhang Z, Bolund L, Wong GK, Zheng W, Dehal P, Wang J, Durbin R: TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 2006, 34: D572-D580. 10.1093/nar/gkj118.PubMedPubMed CentralView ArticleGoogle Scholar
- Huminiecki L, Goldovsky L, Freilich S, Moustakas A, Ouzounis C, Heldin CH: Emergence, development and diversification of the TGF-beta signalling pathway within the animal kingdom. BMC Evol Biol. 2009, 9: 28-10.1186/1471-2148-9-28.PubMedPubMed CentralView ArticleGoogle Scholar
- Huminiecki L, Heldin CH: 2R and remodeling of vertebrate signal transduction engine. BMC Biol. 2010, 8: 146-10.1186/1741-7007-8-146.PubMedPubMed CentralView ArticleGoogle Scholar
- Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003, 52: 696-704. 10.1080/10635150390235520.PubMedView ArticleGoogle Scholar
- The Duplicator. ᅟ, ᅟ, ᅟGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.