Non-targeted transcription factors motifs are a systemic component of ChIP-seq datasets
© Worsley Hunt and Wasserman; licensee BioMed Central 2014
Received: 25 April 2014
Accepted: 29 July 2014
Published: 29 July 2014
Skip to main content
© Worsley Hunt and Wasserman; licensee BioMed Central 2014
Received: 25 April 2014
Accepted: 29 July 2014
Published: 29 July 2014
The global effort to annotate the non-coding portion of the human genome relies heavily on chromatin immunoprecipitation data generated with high-throughput DNA sequencing (ChIP-seq). ChIP-seq is generally successful in detailing the segments of the genome bound by the immunoprecipitated transcription factor (TF), however almost all datasets contain genomic regions devoid of the canonical motif for the TF. It remains to be determined if these regions are related to the immunoprecipitated TF or whether, despite the use of controls, there is a portion of peaks that can be attributed to other causes.
Analyses across hundreds of ChIP-seq datasets generated for sequence-specific DNA binding TFs reveal a small set of TF binding profiles for which predicted TF binding site motifs are repeatedly observed to be significantly enriched. Grouping related binding profiles, the set includes: CTCF-like, ETS-like, JUN-like, and THAP11 profiles. These frequently enriched profiles are termed ‘zingers’ to highlight their unanticipated enrichment in datasets for which they were not the targeted TF, and their potential impact on the interpretation and analysis of TF ChIP-seq data. Peaks with zinger motifs and lacking the ChIPped TF’s motif are observed to compose up to 45% of a ChIP-seq dataset. There is substantial overlap of zinger motif containing regions between diverse TF datasets, suggesting a mechanism that is not TF-specific for the recovery of these regions.
Based on the zinger regions proximity to cohesin-bound segments, a loading station model is proposed. Further study of zingers will advance understanding of gene regulation.
The mapping of the regulatory sequences in the human genome is proceeding rapidly. Large-scale chromatin immunoprecipitation coupled to high-throughput sequencing (ChIP-seq) experiments have been a central component of the mapping efforts, including both transcription factor (TF) target and histone target derivatives . These mapping efforts are providing key insights into the properties of regulatory sequences, the interactions between TFs, and the mechanisms contributing to selective patterns of gene transcription. With the compilation of large and diverse ChIP-seq data collections, an opportunity has emerged to study the common characteristics of TF-bound regions revealed by ChIP-seq.
The characteristics of ChIP-seq data are shaped by both biological and technical influences [2–5]. As with every high-throughput technology, the community learns progressively more about the nuances of the data as they accumulate. Much effort has focused on the development of peak finding methods, which allow for the quantitative determination of TF-bound regions within the sequences recovered in a ChIP-seq experiment. In general, most methods take into account a background rate of sequence recovery and use this background to evaluate the significance of an observed number of mapped reads in the foreground ChIP experiment . Most commonly background sequence data sources are generated from sheared input DNA or mock immunoprecipitation (mock-IP) using a non-specific antibody (for example, IgG). The comparison of the foreground against the background by peak finding software is often the basis for specifying the TF-bound regions, usually delineated with a start, stop, and local maximum read density position (that is, ‘peakMax’).
It is clear that the ChIP-seq procedure is working well for detecting regions bound by sequence-specific TFs. Analysis of ChIP-seq datasets reveals an enrichment of the expected TF binding site (TFBS) pattern close to the peakMax or, where no peakMax is determined, peak centre positions (hereafter also referred to as ‘peakMax’) [6,7]. Ab initio pattern discovery software applied to ChIP-seq data routinely recover the known TFBS pattern , and pattern enrichment methods confirm highly significant enrichment of the TFBS pattern of the ChIPped TF [9,10]. Additionally, a sufficient number of replicates have been performed to demonstrate general consistency between ChIP-seq datasets using the same cells and antibodies .
The properties of DNA in the nucleus have a strong influence on the results of diverse methods, including ChIP-seq and DNase I hypersensitivity mapping data . Both input DNA and diverse ChIPped DNA reveal a strong tendency for the recovery of sequences from promoter regions [4,11], indicating that the DNA shearing process favors regions of open or less compact DNA. These open regions have been demonstrated to be enriched for TF binding and other indicators of accessible DNA such as key histone modifications .
One of the open questions about ChIP-seq results is the not infrequent recovery of peaks under which the target motif of the ChIPped TF is absent. Such observations might be attributable to an inadequate understanding of the TF binding specificity, the potential indirect tethering of a TF to a region through protein-protein interactions, or non-specific antibody pull-down. Based on this background, we sought to understand the properties of ChIP-seq TF binding data, with an emphasis on the identification of mechanisms to account for the past observations of peaks lacking the motif of the targeted TF. Based on our research, we report a striking property of TFBS enrichment around the peakMax for CTCF-like, JUN-like, ETS-like, and THAP11 motifs across a broad set of TF ChIP-seq data. The broadly enriched TFBS classes, which we term ‘zingers’ for their startling enrichment, can account for a substantial portion of TFBS ChIP-seq data. The zinger regions are observed to recur across ChIP-seq data from multiple cell lines and for multiple TFs. These recurring regions tend to be proximal to structural features defined by cohesin and polycomb group proteins. A model to account for the observed properties of zingers is introduced and discussed.
A subset of TF ChIP-seq data has been reported to lack motifs for the ChIPped TF, suggesting that there may be additional proteins interacting in a sequence specific manner with these regions. Drawing together diverse TF-ChIP-seq data, we sought to determine if characterized TFs might account for a portion of the discrepancy. To measure the enrichment of TF motifs across the compiled TF ChIP-seq datasets we performed motif over-representation analyses, using the oPOSSUM 3.0 software . We tested 165 position weight matrices (PWMs) selectively curated from the JASPAR development database (see methods), on 285 human datasets (33 cell-lines) for 101 TFs (ENCODE and other resources; see Materials and methods). A parallel analysis of mouse data was performed for 81 datasets (12 cell-lines) encompassing 43 TFs (ENCODE and other resources; see Materials and methods). For each oPOSSUM analysis we provided a set of background sequences of similar length and nucleotide composition relative to the ChIP-seq dataset (all peaks were constrained to 401 bp length). As there were two or more ChIP-seq datasets for many TFs, generated from different cell lines or conditions, we averaged the oPOSSUM enrichment scores across all datasets for a given ChIPped TF. The details of the statistical measures and assessed thresholds are presented in the methods. Briefly, two oPOSSUM enrichment scores were used to evaluate the datasets: a Fisher-log score (to assess enrichment of motifs across many ChIP-seq peaks) and a Kolmogorov-Smirnov (KS) centrality score (to assess enrichment of motifs in proximity to the peakMax position).
To assess if zinger enrichment is independent of the ChIPped TFs’ motifs (that is, not over-lapping the expected motif), we performed a second enrichment analysis on human ChIP-seq sequences in which the ChIPped TF motifs were masked (thus restricting the analysis to the subset of ChIP-seq datasets for which a TF binding profile is available). We again consider the two metrics of Fisher-log enrichment score and KS centrality score. The CTCF, ETS-like, and JUN-like zinger motifs remained enriched (Additional file 2: Figure S2A).
Short patterns, such as those found by PWMs, can occur by chance in the genome. To confirm the findings of zinger-specific enrichment, we shuffled the zinger PFMs and determined the likelihood of achieving the frequency of enriched datasets observed for the original profile (see Materials and methods). In all cases, the comparison against the frequency of enriched datasets obtained with the shuffled matrices confirmed that the true zinger motifs’ enrichment was extremely unlikely to occur by chance (P values are: 2.5e-44 for CTCF, 2.8e-09 for GABPA, and 3.7e-08 for JUN).
We sought to determine if ab initio pattern discovery could recover either novel profiles or known TFBS profiles in pooled data, a process requiring a greater signal-to-noise ratio than the more noise-tolerant oPOSSUM motif enrichment testing above. Across all of the ChIP-seq data, we masked the motif of the ChIPped TF and repeat-masked the sequences (see Materials and methods), then drew five sets of 5,000 sequences from the ChIP-seq pool and subjected each set to pattern discovery analysis using the MEME system . From the five replicate pools, MEME returned profiles for wide and high information content patterns. In all cases MEME detected a pattern consistent with the CTCF binding profile in the top six results (Figure 1C, top logo) and a profile unknown to MEME Suite’s TOMTOM pattern similarity scoring system  (Figure 1C, bottom logo). A report from Ngondo-Mbongo et al.  identified that THAP11 binds to a motif that matches the unknown profile, so we will hereafter refer to the MEME derived profile as the THAP11 profile. We reviewed oPOSSUM results for the enrichment of the THAP11 motif, and found that it is consistent with the zingers for the Fisher-log score enrichment frequency, but the motif is not frequently observed to be centrally positioned based on the oPOSSUM KS-score (although it is proximal to the peakMax by the heuristic motif enrichment method presented below in this report). Given the strength of evidence, we elected to classify THAP11 motif as an additional zinger.
Using the oPOSSUM enrichment analysis procedure, we sought to determine if the zingers showed enrichment in other genomic data collections. ChIP-seq data are recognized to be highly enriched with open chromatin regions, and in particular ChIP-seq data for CTCF, one of the zinger TFs, are known to strongly overlap with DNase I hypersensitive sites [16,17]. We therefore analyzed ENCODE DNaseI-seq and Faire-seq data to assess the enrichment of the zinger motifs. Each region (average 150 bp) was extended to 401 bp for enrichment analysis using the oPOSSUM 3.0 software. oPOSSUM enrichment results revealed the zinger profiles to be the most frequently enriched within DNaseI-seq and Faire-seq datasets, showing enrichment near the region centre in 50% to 100% of the DNaseI-seq datasets, and 20% to 92% of Faire-seq datasets (Additional file 3: Figure S3). We further assessed the ratio of zinger motifs in DNase and Faire regions compared to flanking regions, providing an indication of the portion of each dataset that could be attributed to zingers: mean values of 47% for DNaseI-seq and 13% for Faire-seq were obtained (see Additional file 4: Text S1).
We have observed enrichment of zingers in other open chromatin associated data such as ChIP-seq data for helicase-related proteins or histone modifiers (Additional file 4: Text S1 and Additional file 5: Figure S4), and ChIP-seq control data (Additional file 4: Text S1 and Additional file 6: Figure S5). Thus zinger motifs are observed in multiple classes of genomic datasets.
For ease of reference, we will hereafter use ‘zinger motifs’ to refer to the collection of CTCF, JUN-like, ETS-like, and THAP11 motifs within the enrichment zones and ‘zinger motif peaks’ to refer to those peaks within a dataset that have a zinger motif but not the ChIPped TF’s motif. Motif predictions outside the enrichment zones will be referred to as ‘distal-zinger’ motifs.
As anticipated, peaks with the ChIPped TF’s motif proximal to the peakMax comprised the majority of most datasets (up to 99% in the best case). After accounting for background ChIPped TF motif rates, the mean observed portion was 55% (the median was 59% with a median absolute deviation (MAD) of 27 pp). There are, however, extreme cases in which the ChIPped TF’s canonical binding motif is present in less than 10% of the peaks (Additional file 4: Text S1 and Additional file 8: Figure S7).
After accounting for background, and excluding two outliers, up to 45% of a ChIP-seq dataset are zinger motif peaks with a mean of 12% (median of 9% with a MAD of 3 pp) (Additional file 9: Figure S8A). The zinger motif peaks account for up to 69% of the set of peaks unexplained by the ChIPped TF’s motif, with a mean of 27% (median of 27% with MAD of 14 pp), in datasets with at least 1% zinger motif peak content (Additional file 9: Figure S8B); the zinger motif peak enrichment is visually depicted in a heat map format (Additional file 9: Figure S8C). For clarity, the portion of zinger motif peaks are anti-correlated with the portion of ChIPped TF motif peaks (Additional file 9: Figure S8D).
As zinger motifs are present in peaks without the ChIPped TF’s motif we wanted to determine if there were any characteristics specific to or in common among this set of peaks. We found that neither the presence nor proportion of zinger motif peaks within a ChIP-seq dataset is dependent on cell type, as seen in Additional file 10: Figure S9A for the five most abundant cell lines. Neither, is the proportion of zinger motif-containing peaks consistent across multiple datasets for the same TF (Additional file 10: Figure S9B).
Next we asked if the zinger motifs have a strong tendency to co-occur in the same zinger motif peaks. We found that at most 11% of datasets show a positive association with a significant P value (Fisher exact P values <0.001 and log odds ratios >0) for any pairwise co-occurrence of two different zingers within a single peak (the most frequent pair of zingers being GABPA and THAP11). A few datasets (17%) show a negative association for zinger motif co-occurrence with a significant P value (Additional file 11: Figure S10A). Thus, the zinger motifs are not inter-dependent. We next evaluated the pairwise tendency for zinger motif peak enrichment within the same ChIP-seq datasets, finding unremarkable correlation values (correlation coefficients -0.0233 to 0.3803) (Additional file 11: Figure S10B).
Lastly we determined whether zinger motif peaks were consistently located near a feature in the genome. We evaluated the proximity of zinger-associated regions to genomic features such as transcription start sites (TSS), CpG islands, conserved regions, and repeat sequence regions. Comparing the set of zinger motif peaks to peaks with the ChIPped TF’s motif, we did not detect consistent enrichment tendencies that distinguished between the two sets of regions (Additional file 4: Text S1).
As zinger motifs are an unexpected presence across datasets we assessed the quality of the peaks they occur in, asking if the zinger motifs tended to be in the lower scoring peaks of the dataset. We compared the peak calling scores of peaks containing the ChIPped TF’s motif against peaks with a zingers’ motif. The peak scores for the zinger containing peaks are significantly poorer than for those peaks with the ChIPped TF’s canonical motif (Wilcoxon one-tailed test P values <5.0e-05).
A comparison of the peak scores for the ChIP-seq peaks that overlapped the set of zinger motif peaks versus the set of distal-zinger peaks revealed a significant difference between the two groups (Wilcoxon one-tailed test significance threshold P value <0.001). The zinger motif peaks associated with stronger scoring ChIP-seq peaks than did the distal-zinger peaks for the majority of datasets (that is, 81%, 67%, and 79% of CTCF, JUN, and GABPA ChIP-seq datasets, respectively).
As zinger motif peaks are enriched in numerous datasets for which the zinger is not the targeted TF, we asked whether the same zinger regions were occurring repeatedly across multiple datasets, that is, are the same zinger regions being ChIPped by many TFs. We pooled the zinger motif peaks, which by definition lacked the motif of the ChIPped TF, from across datasets (33 cell lines; 823,574 peaks), requiring that the zinger motif have a strong motif score of 85 or greater to reduce false positives. We assigned peaks whose peakMax were within 50 bp of each other into neighbourhoods (see Materials and methods), and then assessed the recurrence of each neighborhood, that is, the number of unique TFs whose datasets contributed a zinger motif peak to the neighbourhood.
We similarly generated neighborhoods from those regions with neither the ChIPped TF nor zinger motifs (unidentified motif neighborhoods - 536,546), and from the regions found to have a high scoring motif (score >85) for the ChIPped TF and no zinger motif (ChIPped TF neighborhoods - 408,677) (see Materials and methods). The zinger neighborhoods were found to be ChIPped by significantly more unique TFs than are the other two sets of neighborhoods (Wilcoxon one-tailed test P value = 0).
The recurrence of the zinger motif peaks across datasets prompted us to consider the motif content of HOT regions. HOT (high occupancy of transcription-related proteins) regions, as defined by Yip et al. , are ChIP-seq regions that within a single cell line (GM12878, HeLa, H1-hESC, HepG2, or K562) demonstrate binding co-occurrence among chromatin-related factors, general TFs, and sequence-specific TFs. Yip et al. noted that a substantial portion of a cell-line’s HOT regions are motif-less for the ChIPped factor, and associate with strong DNaseI signals. The HOT regions are present in two or more cell lines in 25% of cases according to Yip et al., while zinger neighborhoods were noted above to be 77% of cases. oPOSSUM over-representation analysis on the combined set of HOT regions found the zinger motifs to be 13 out of the 20 most enriched patterns, consistent with what was observed above for the DNaseI-seq/Faire-seq open chromatin datasets (Additional file 13: Table S1).
Recurring open chromatin enrichment across datasets suggested that structural properties of chromatin might contribute to zinger motif recovery across ChIP experiments . Cohesin is a protein noted for both its role in gene regulation and DNA structure [20,21]. It is a multi-subunit complex, which is believed to form a ring like structure around DNA, and has been well documented in its role of sister chromatid interaction during the mitotic metaphase. Cohesin has also been implicated in promoting interaction between enhancers and core promoters of active genes in embryonic stem cells  and in chromosomal looping . Chromosomal looping may be a structural element that is conducive to DNA shearing under the stress of sonication. Additionally, cohesin or associated proteins may function as a ‘loading station’ by bringing together proteins bound to remote regulatory elements and promoter regions that will in turn regulate transcription within the looped region .
We evaluated the proximity of the zinger neighborhoods to cohesin-interacting regions. Zinger neighborhoods are enriched for proximity (that is, within 500 bp) to cohesin regions (via RAD21 and SMC3 ChIP-seq) compared to the ChIPped TF neighborhoods or unidentified motif neighborhoods (Fisher exact one-tailed test P value of 0 for both comparisons; 77% of the zinger neighborhoods observed for multiple TFs are proximal, while 46% of the unidentified motif and 13% of the ChIPped TF neighborhoods are so positioned). The neighborhoods for unidentified motif peaks were also significantly more proximal to cohesin than neighborhoods from ChIPped TF peaks (Fisher exact one-tailed test P value of 0). As some of the neighborhoods contain CTCF zinger attributed regions, and cohesin is known to interact with CTCF [24,25], we removed neighborhoods within 500 bp of a CTCF ChIP-seq region and repeated the analysis. Regardless of the depletion of CTCF associated neighborhoods, the zinger neighborhoods remained significantly closer to cohesin (Fisher exact one-tailed test P value of 0 for all comparisons).
Another system noted to impact chromatin structure are the polycomb group proteins (including polycomb repressive complex 1 (PRC1) and polycomb repressive complex 1 (PRC2) forms), which are implicated in the remodeling of chromatin. In drosophila, PRC1 has been noted to interact with cohesin to co-regulate active genes . We used ChIP-seq data for the constituent proteins CBX and EZH2 proteins to identify regions bound by the PRC1 and PRC2 complexes, respectively. We found that the zinger neighborhoods were significantly closer to CBX peaks and EZH2 peaks than are the neighborhoods derived from either ChIPped TF motif peaks, or from unidentified motif peaks (Fisher exact one-tailed test P value of 0). We observed that the PRC1 and PRC2 peaks proximal to the zinger neighborhoods, tend to be those that are also within 500 bp of cohesin (Fisher exact one-tailed test P value <7.6e-160 for PRC1, and P = 0 for PRC2). The unidentified motif neighborhoods are, in turn, significantly closer to PRC regions than the neighborhoods derived from peaks with the motif for the ChIP-seq experiment’s targeted TF.
Thus, the zinger neighborhoods, and to a lesser degree the unidentified motif neighborhoods, are associated with cohesin and polycomb repressive complex regions. This suggests that these diverse regions, which were initially identified as not containing the motif of the ChIPped TF, and yet in many cases enriched for an alternative motif (zingers), may be part of a structure involving cohesin. Such a structure could influence the tendency for these regions to be detected recurrently across diverse ChIP-seq data.
ChIP-seq experiments are increasingly used to investigate how sequence-specific DNA binding TFs regulate gene expression. In this report, we introduce ‘zingers’: four classes of TFBSs that display significant binding site enrichment, unexpectedly proximal to the peakMax, across ChIP-seq binding experiments for other TFs. Within individual TF ChIP-seq experiments, up to 45% of peaks are observed that lack the canonical TF binding motif and contain a zinger motif, with a mean of 12% (median 9%). While biased to the lower scoring peaks in other TF ChIP-seq data, the same zinger-associated regions tend to be high scoring peaks within datasets ChIPped for the zinger TF; indicating these regions are likely bound by the zinger TF. The zinger motif peaks derive from 257,631 regions (neighborhoods) in the genome, 36% of which are observed recurrently across datasets for diverse TFs, in sharp contrast to neighborhoods containing only the ChIPped TF’s motif, which recur relatively infrequently. Some regions lacking both the ChIPped TF’s motif and a zinger motif, are also recurrently observed. Both zinger motif and unidentified motif neighborhoods are positioned proximal to structural regions defined by the presence of cohesin and polycomb group complexes. Accounting for the contribution of zinger-associated regions to global studies of regulatory sequences will be a consideration for future analysis of ChIP-seq data.
From a broader mechanistic perspective, a loading station mechanism is consistent with the ‘hop-skip-jump’ theory for how TFs efficiently search the nucleus to arrive at their TFBSs . The proposed loading station model is supported in recent literature. Faure et al.  propose a role for cohesin in stabilizing large protein-DNA complexes. While this manuscript was under review, Taipale et al.  published a study using the LoVo cell line suggesting that cohesin participates in holding chromatin open during cell division to facilitate TFs relocating back to those regions once division is complete.
The zinger content of every ChIP-seq dataset should be evaluated, consistent with a growing effort to critically evaluate such data [12,29,30]. For instance, the STAT1 (GM12878) ChIP-seq dataset exceeds 30% of peaks with zinger motifs proximal to the peakMax, while STAT1 motifs occur only at the background frequency. We propose a general approach for the study of zinger content. For each ChIP-seq dataset, the peak regions should be scanned for the presence of the ChIPped TF motif in proximity to the peakMax. The peaks lacking a ChIPped TF motif should be compared to the recurring zinger neighborhoods (Additional file 12: Dataset S1). The portion of the dataset overlapping the neighborhoods gives insight into the overall specificity of the experiment.
We have identified zinger motifs that are frequently enriched across a portion of TF ChIP-seq data, including CTCF-like, ETS-like, and JUN-like motif families, and THAP11. As high-throughput ChIP-seq data informs genome annotation, research into gene regulation may be impacted by zinger motif derived annotations. Moving forward it will be important to determine the prevalence of zinger-like motifs in ChIP-seq data in diverse organisms, probe the structural properties of the zinger regions, and develop computational approaches to systematically identify recurring zinger regions in large-scale genome annotation. Ultimately, understanding the biophysical processes that result in the zinger motif enrichment in ChIP-seq data may provide broader insight into the mechanisms of transcription regulation.
For our analyses, we used ENCODE ChIP-seq datasets (human and mouse), ENCODE DNaseI-seq and Faire-seq data, and human ChIP-seq controls  downloaded from the UCSC ENCODE database . We also incorporated non-ENCODE ChIP datasets downloaded from GEO: (1) GSE11431 - 13 mouse ESC datasets ; (2) GSE25532 - mouse NFYA data in ES cells ; (3) GSE17917 and GSE18292 - human KLF4, POU5F1, cMYC, NANOG, and SOX2 data ; and (4) GSE22078 - human and mouse CEBPA and HNF4A . Where only the mapped data were available, we used FindPeaks 4.0  to call peaks using the following parameter options: dist_type 1 200 -subpeaks 0.6 -trim 0.2 -duplicatefilter. The ENCODE broadPeak datasets frequently occurred in replicate; to avoid duplication, only the replicate with the most peaks of a pair was used for analyses.
Where coordinates were provided as NCBI36/hg18 or NCBI36/mm8, they were first converted to GRCh37/hg19 or NCBI37/mm9, using a locally installed version of the UCSC lift-over tool . We then used the Ensembl API to retrieve sequences from GRCh37/hg19 and NCBI37/mm9 assemblies.
The ENCODE ChIP-seq data are in one of two formats, narrowPeak and broadPeak. Both formats contain two columns pertaining to statistical significance of the peaks (also known as peak scores): one is a P value, the other a q value (bonferroni corrected). We used the q value field when it was assigned, and otherwise used the P value field.
As peaks are reported in a multitude of lengths, in the range of 1 bp to greater than 5,000 bp, we trimmed or extended all peaks to a constant length centered at the peak maximum for narrowPeak format datasets, or at the peak centre for broadPeak format and DNase-seq/Faire-seq datasets. For enrichment visualization and determining heuristic boundaries of enrichment we used 1,001 bp sequences, oPOSSUM TFBS enrichment analysis input was 401 bp sequences, and ab initio motif detection input was 201 bp sequences.
Position frequency matrices (PFMs) were obtained from the JASPAR  development 4.0_alpha database of transcription factor models (prior to 2013, April). Where the JASPAR PFM did not agree with the consensus in the literature we performed an ab initio analysis on the top 500 peaks (selected by peak score) of two or more ChIP-seq datasets for the given TF, using a locally installed version of the MEME software . MEME results were then checked against the literature and for enrichment in a different ChIP-seq dataset for the given TF. MEME position specific probability matrices (PSPM) were converted to PFMs by transposing the PSPM and multiplying each letter (A, G, C, T) frequency in the matrix by the number of sites found by MEME. The PFMs were subsequently converted to position weight matrices (PWMs), using the TFBS Perl Module , only PWMs based on PFMs with information content (IC) greater than 8 bits were retained. The PFMs used in this study are provided in Additional file 14: Dataset S2.
For those analyses using datasets of shuffled matrices, the datasets were generated by random permutation of all columns of the originating PFMs, excluding the lower information content columns on the edges (2 columns on each side for all cases, except for the wider CTCF PWM for which 3 columns on each side were held constant).
Motif over-representation analyses were performed with a locally installed version of oPOSSUM 3.0 . We used the sequence-based analysis option with default settings, except for specifying the use of the JASPAR development PFM matrices (Additional file 14: Dataset S2). We trimmed or extended all peaks to 401 bp. Backgrounds for the over-representation analyses came from the mappable portion of the genome, and were chosen to match the sequence length and mononucleotide GC composition distribution of each dataset.
The oPOSSUM Fisher-log enrichment score is derived from a one-tailed Fisher exact probability test, based on the hypergeometric distribution which compares the number of sequences that contain a motif for the TF of interest in the target and background datasets. The negative natural logarithm of the Fisher test probabilities is the reported Fisher-log score. Thus a Fisher-log score of 6.91 or higher is equivalent to a P value of 0.001 or lower. Fisher-log enrichment scores of ‘infinite’ value were set to either 500 or to 100 past the maximum non-infinite Fisher-log score.
The oPOSSUM KS centrality score is the negative logarithm of the probabilities from a Kolmogorov-Smirnov test. Thus a KS score of 6.91 or higher is equivalent to a P value of 0.001 or lower. The Kolmogorov-Smirnov tests whether a TF’s motifs are positionally enriched at the center of the target sequences relative to the motifs in the background set of sequences. KS ‘infinite’ enrichment scores were set to 100.
To calculate the number of datasets enriched for a motif we first obtained the average Fisher-log score and KS log score for datasets ChIPped for the same TF. Once we had a set of scores for each TF, we used a binary count of 1 or 0 to indicate whether both of the oPOSSUM enrichment scores passed a threshold based on the standard deviation (SD) of the scores or not (two SD for Fisher-log scores and one SD for KS log scores). This yielded the number of datasets with enrichment around the sequence midpoint for each of the 165 TFs. We then applied a further correction to compensate for the bias created by multiple datasets for families of TFs that recognize the same motif (for example, JUN, JUND, JUNB, AP1, FOS, FOSL1, FOSL2, and BATF PWMs all recognize a TGA(g/c)TCA consensus). The number of motif-family members, minus one, was subtracted from the count of datasets for each of the member TFs, for example, if JUNB were enriched in 20 TF datasets, and 9 of those datasets were ChIPped for a TF that recognizes the JUN-motif family consensus, then a count of eight would be subtracted from 20. The 165 TFs were then ranked according to this final number of associated datasets.
To assess the probability of a PWM’s predictions being enriched within as many datasets as observed with the zinger PWMs, we shuffled the PFMs of the zingers and fit a distribution to the results. We generated 100 shuffled matrices as described above. We performed oPOSSUM enrichment analyses with the shuffled PWMs, on the same human datasets as used to generate Figure 1. The oPOSSUM results were evaluated as outlined above. However, we applied the enrichment score thresholds for each dataset as was set for the original PWMs. We then counted the number of datasets within which each shuffled profile was enriched, and fit a zero-adjusted logarithmic distribution (ZALG) to the counts. The distribution was selected using the fitDist() function in the R statistical package GAMLSS 4.1-5 , and the parameters describing the distribution were obtained with gamlss family ZALG and the gamlss() function. We tested for goodness-of-fit of the distribution to the data by generating datasets from the random generation function, rZALG, and assaying the similarity of the generated distributions to our data using a chi-squared test. The fitted distribution function was then used to determine the probability of the shuffled PWMs obtaining a result as extreme as the original PWM. The probability was calculated with the density function for the zero-adjusted logarithmic distribution (dZALG).
Motif prediction was performed with C-code adapted from the TFBS Perl Module , reporting relative motif scores. Motifs predicted by a PFM are not permitted to overlap by more than one-fifth the PFM length (this setting is intended to equate to the low information content flanks of a PWM), for example, a 7 bp motif could only overlap a neighboring motif by 1 bp.
For post-oPOSSUM analyses, we predicted the presence of zinger motifs using one PWM per zinger TF motif family as proxy, to prevent redundancies. CTCF-like motifs were predicted with the CTCF PWM, ETS-like motifs with the GABPA PWM, JUN-like motifs with the JUN PWM, and THAP11 motifs with a THAP11 PWM.
MEME  analyses were run using the following options: -dna -nmotifs 10 -minw 6 -maxw 15 -maxsize 2000000 -mod zoops -revcomp. TOMTOM  analyses were run with default values, aside from increasing the E-value threshold to 20, from the web server.
Data processing and statistical analyses were done with a combination of in-house Unix and R scripts (R version 2.14.1) . Throughout the manuscript we report the combination of median and the median absolute deviation (MAD), a measure of dispersion around the median. For a normal distribution the median and MAD are the same values as the mean and SD.
To visualize peakMax proximal enrichment of TF motifs within ChIP-seq datasets, the top scoring predicted motif in each region for the given TF PWM, was plotted relative to its signed distance from the peakMax (using the R basic statistical package ). The dense horizontal ranges of motif scores spanning all positions relative to the peakMax, such as seen in the Figure 2 plots, are observed for the combination of most PWMs and ChIP-seq datasets, and are likely a mixture of both false and true TFBS predictions. Those motif matches that are distal to the peakMax are anticipated to be less reliable, as the observed frequency is consistent with background rates of motif prediction. If we take enrichment proximal to the peakMax as a measure of confidence for the predictions we can determine a distance threshold and motif score threshold (see next section) at a point where motif frequency proximal to the peakMax is greater than the flanking distal motif frequency. Using this threshold, we can select a sub-population of peaks that are less likely to have arisen by chance.
We assessed the enrichment of motif distance to the peakMax and motif score, using a heuristic method for topological motif enrichment , which we outline in brief here. To determine whether a motif was proximal to the peakMax, we used heuristic distance boundaries derived from the density of the top scoring motif for each 1,001 bp region. We identified the location, relative to the 501st bp, at which the density of motifs exceeds that of the distal region (approximately 175 to 500 bp distant from the peakMax). This change in density is observed in the TFBS-landscape plots of Figure 2, where there is a constant density of motif scores in the distal regions and an increase in the density of motif scores within approximately 100 bp of the peakMax. The heuristic distance boundaries were set at the transition point. A similar procedure was applied to determine a threshold for the motif score, where the motif score threshold was set at the point where the motif enrichment proximal to the peakMax was at least 20% higher than the flanking enrichment. The region defined by the distance boundaries and the motif score threshold, was termed the ‘enrichment zone’. The enrichment zone was subsequently used to specify peakMax enriched proximal motifs. On average, an enrichment boundary was ±90 bp from the peakMax, and the motif relative score threshold was 82.
The heuristic analysis of motif enrichment across datasets reports that on average a CTCF zinger motif is enriched above a motif score threshold of 79, while for JUN the average was 86, for GABPA it was 83, and for THAP11 it was 84. CTCF and THAP11 in particular consistently have enrichment above a motif score threshold of 85 that is strongly distinct from the flanking regions of similar score range, as seen in Figure 2A and D. The regions that flank the peakMax proximal enrichment in Figure 2 are representative of the background expectation of a PWM’s motif prediction. Thus, to reduce the presence of false positive predictions in subsets of peaks we analyzed, we selected, where noted in the main text, peaks with a motif scoring above the motif score threshold of 85. The use of a single threshold permits the processing of data as a single unit. A motif score of 85 is also the default threshold score in the oPOSSUM software.
To estimate the proportion of regions in a given dataset in which motifs may result from background motif prediction, we compared the count of regions with motifs in the enrichment zone relative to the count of regions with motifs at least 50 bp outside the enrichment zone. The distal ‘zone’ from which counts were determined, was set to be the same length of sequence as the enrichment zone, that is, if the enrichment zone was 200 bp wide, then the distal zone was also 200 bp wide (100 bp from 5′ and 100 bp from 3′ of the region center). To estimate the proportion of regions in the enrichment zone with false positives, we divided the number of regions with motifs in the distal zone by the number of regions with motifs in the enrichment zone. See Additional file 4: Text S1 for the estimated overall background expectation of ChIPped TF and zinger motif prediction.
Calculating the background corrected estimates of ChIPped TF and zinger motif proportions within a dataset was done by subtracting the distal zone count from the enrichment zone count for the ChIPped TF or each zinger. For the ChIPped TF, the corrected count was divided by the size of the dataset. For the four zingers, the four corrected counts were first summed, and then divided by the size of the dataset.
Heatmaps were created with the heatmap.2() function from the R statistical package: gplots, with the distance measure as ‘manhattan’ and the ‘ward’ agglomeration method for clustering.
The heatmap of zinger motif peak log2 fold enrichment was generated using the log2 fold enrichment of zinger motif peaks with motif score 85 or greater, relative to distal-zinger peaks of similar score range. Where the fold enrichment was below 1.5 we assigned a minimum value, represented as a grey colour in the heatmap, to facilitate visualization.
A heatmap of zinger motif inter-dependency within datasets was generated using the set of zinger motif peaks with motif scores equal to or greater than 85, and a 2×2 confusion matrix for each pair of zinger motifs. A Fisher exact P value <0.001 was taken to indicate significance and the sign of the log odds ratio to indicate whether a positive or negative association existed. The values used to generate the heatmap were 1-pvalue for positive associations, -1*(1-pvalue) for the negative associations, and 0 for the non-significant P values.
The pairwise correlation of zinger motif peaks for the different zingers, across datasets, was assessed using the log2 fold enrichment values generated for the above heatmap. The correlations were evaluated with both Pearson correlation and Spearman’s rank order correlation (R basic statistical package: cor() function).
We obtained controls from a range of cell types and ENCODE consortium groups, and processed the mapped reads with FindPeaks. We used the peak height to rank the control peaks, and then selected the top 70,000 peaks. The number of peaks was chosen to match the average size of the ChIP-seq datasets. The peaks were then scored with the zinger PWMs and the enrichment of the motifs with respect to the peakMax was evaluated.
We compared the genomic feature proximity of zinger motif peaks, with those peaks containing the ChIPped TF’s motif and lacking zinger motifs. We measured the distance between the peakMax and the middle of the feature, which in the case of transcription start sites (TSSs) was simply the starting coordinate of the transcript. We used only those datasets for which we had at least 200 zinger motif peaks. The number of peaks that were within 500 bp, 1 kb or 5 kb of the TSS, or within 500 bp of CpG islands, conserved regions or repeat-masked regions were compared between the zinger motif peaks and the ChIPped TF peaks using a Fisher exact test. For the results of a zinger to be considered striking we required that at least 60% of the datasets with zinger motifs show statistical significance in one direction, that is, either 60% of datasets tend to be proximal to a feature, or 60% of datasets tend to be distal to a feature.
We assessed the proximity of the zinger motif peaks with a high scoring zinger motif (score >85) to ChIP-seq peaks ChIPped by the zinger’s TF to determine whether the zinger motif peaks found in datasets for which the zinger is not the targeted TF, are potential bona fide binding regions for the zinger TF. In all cases we required that the zinger motif peaks and zinger TF’s ChIP-seq data be from the same cell line. To call a zinger motif peak in agreement with the zinger TF’s ChIP-seq data we required that the peakMax of the zinger motif peak be within 100 bp of a peakMax in the zinger TF’s dataset. This 100 bp distance reflects the average range of enrichment for a TF’s motif relative to the peakMax. The assessment of the distal-zinger peaks, that is, those peaks with motifs not proximal to the peakMax, relative to the zinger TF’s ChIP-seq dataset was performed in the same manner.
To determine the degree of recurrence for a zinger motif peak region across multiple datasets we pooled all zinger motif peaks that had a high scoring (score >85) zinger motif from all datasets. We then calculated the inter-zinger distances between each zinger motif peak and its nearest neighbour in the 3′ direction on the plus strand. Consecutive peaks that were within 50 bp of their nearest neighbor were merged into a ‘zinger neighborhood’. The distance of 50 bp was chosen as a stringent measure of proximity between zinger motif peaks. For each neighborhood, we counted the number of unique TFs that ChIPped the zinger motif peaks and the number of unique cell lines. We provide the coordinates for the zinger neighborhoods in Additional file 12: Dataset S1.
We generated neighborhoods from the remaining two groups of peaks in a similar manner: those with the ChIPped TF motifs and lacking zinger motifs (‘ChIPped TF neighborhoods’), and those without either motif (‘unidentified motif neighborhoods’). For the ChIPped TF neighborhoods we required that there be a high scoring motif (score >85) for the ChIPped TF. Neighborhood widths were <150 bp on average. As stated in the main text, zinger motif peaks may be bona fide binding regions for the zinger TF. Thus, after generating the neighborhood sets, we removed from the ChIPped TF neighborhoods those regions that were within 300 bp (measured centre to centre of the zinger neighborhoods to ensure that comparisons were made between distinct neighborhood sets. We also removed from the ChIPped TF neighborhoods those regions that overlapped the unidentified motif neighborhoods in the same manner.
To assess whether a neighborhood is proximal to a region occupied by cohesin or the polycomb repressive complex (PRC) 1 or 2, we generated three datasets by combining the ENCODE ChIP-seq data for the cohesin proteins, RAD21 and SMC3, into a one dataset; combining the ENCODE ChIP-seq data for CBX to form a dataset for PRC1 occupancy, and lastly combining EZH2 ChIP-seq data into a dataset for PRC2 occupancy. We then assessed how many zinger neighborhoods were situated within 500 bp of one of the three protein complexes, measuring from the center of a neighborhood to the ChIP-seq peakMax, and compared this to the two other neighborhoods.
Chromatin immunoprecipitation and high-throughput sequencing
Peak local maximum
Position frequency matrix
Position weight matrix
Transcription factor binding site
We thank Dr. Sorana Morrissy, Dr. Anthony Fejes, Dr. Francois Parcy, and Dr. Maja Tarailo-Graovac, Dr. Anthony Mathelier, Chih-yu Chen and the other members of the Wasserman lab for suggestions and feedback, Miroslav Hatas for systems support, and Dora Pak for management support. We are indebted to the researchers around the globe who generated the ChIP-seq data. The work was supported by funding from the National Institutes of Health (USA) grant 1R01GM084875, the Canadian Institutes for Health Research, the National Science and Engineering Research Council (NSERC), and GenomeBC and GenomeCanada (ABC4DE Project). RWH was supported by fellowships from the Canadian Institutes of Health Research (CIHR) and the Michael Smith Foundation for Health Research (MSFHR).
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.