### Datasets

For our analyses, we used ENCODE ChIP-seq datasets (human and mouse), ENCODE DNaseI-seq and Faire-seq data, and human ChIP-seq controls [1] downloaded from the UCSC ENCODE database [31]. We also incorporated non-ENCODE ChIP datasets downloaded from GEO: (1) GSE11431 - 13 mouse ESC datasets [32]; (2) GSE25532 - mouse NFYA data in ES cells [33]; (3) GSE17917 and GSE18292 - human KLF4, POU5F1, cMYC, NANOG, and SOX2 data [34]; and (4) GSE22078 - human and mouse CEBPA and HNF4A [35]. Where only the mapped data were available, we used FindPeaks 4.0 [36] to call peaks using the following parameter options: dist_type 1 200 -subpeaks 0.6 -trim 0.2 -duplicatefilter. The ENCODE broadPeak datasets frequently occurred in replicate; to avoid duplication, only the replicate with the most peaks of a pair was used for analyses.

Where coordinates were provided as NCBI36/hg18 or NCBI36/mm8, they were first converted to GRCh37/hg19 or NCBI37/mm9, using a locally installed version of the UCSC lift-over tool [37]. We then used the Ensembl API to retrieve sequences from GRCh37/hg19 and NCBI37/mm9 assemblies.

The ENCODE ChIP-seq data are in one of two formats, narrowPeak and broadPeak. Both formats contain two columns pertaining to statistical significance of the peaks (also known as peak scores): one is a *P* value, the other a q value (bonferroni corrected). We used the q value field when it was assigned, and otherwise used the *P* value field.

As peaks are reported in a multitude of lengths, in the range of 1 bp to greater than 5,000 bp, we trimmed or extended all peaks to a constant length centered at the peak maximum for narrowPeak format datasets, or at the peak centre for broadPeak format and DNase-seq/Faire-seq datasets. For enrichment visualization and determining heuristic boundaries of enrichment we used 1,001 bp sequences, oPOSSUM TFBS enrichment analysis input was 401 bp sequences, and *ab initio* motif detection input was 201 bp sequences.

Position frequency matrices (PFMs) were obtained from the JASPAR [38] development 4.0_alpha database of transcription factor models (prior to 2013, April). Where the JASPAR PFM did not agree with the consensus in the literature we performed an *ab initio* analysis on the top 500 peaks (selected by peak score) of two or more ChIP-seq datasets for the given TF, using a locally installed version of the MEME software [8]. MEME results were then checked against the literature and for enrichment in a different ChIP-seq dataset for the given TF. MEME position specific probability matrices (PSPM) were converted to PFMs by transposing the PSPM and multiplying each letter (A, G, C, T) frequency in the matrix by the number of sites found by MEME. The PFMs were subsequently converted to position weight matrices (PWMs), using the TFBS Perl Module [39], only PWMs based on PFMs with information content (IC) greater than 8 bits were retained. The PFMs used in this study are provided in Additional file 14: Dataset S2.

For those analyses using datasets of shuffled matrices, the datasets were generated by random permutation of all columns of the originating PFMs, excluding the lower information content columns on the edges (2 columns on each side for all cases, except for the wider CTCF PWM for which 3 columns on each side were held constant).

### Motif over-representation analysis

Motif over-representation analyses were performed with a locally installed version of oPOSSUM 3.0 [9]. We used the sequence-based analysis option with default settings, except for specifying the use of the JASPAR development PFM matrices (Additional file 14: Dataset S2). We trimmed or extended all peaks to 401 bp. Backgrounds for the over-representation analyses came from the mappable portion of the genome, and were chosen to match the sequence length and mononucleotide GC composition distribution of each dataset.

The oPOSSUM Fisher-log enrichment score is derived from a one-tailed Fisher exact probability test, based on the hypergeometric distribution which compares the number of sequences that contain a motif for the TF of interest in the target and background datasets. The negative natural logarithm of the Fisher test probabilities is the reported Fisher-log score. Thus a Fisher-log score of 6.91 or higher is equivalent to a *P* value of 0.001 or lower. Fisher-log enrichment scores of ‘infinite’ value were set to either 500 or to 100 past the maximum non-infinite Fisher-log score.

The oPOSSUM KS centrality score is the negative logarithm of the probabilities from a Kolmogorov-Smirnov test. Thus a KS score of 6.91 or higher is equivalent to a *P* value of 0.001 or lower. The Kolmogorov-Smirnov tests whether a TF’s motifs are positionally enriched at the center of the target sequences relative to the motifs in the background set of sequences. KS ‘infinite’ enrichment scores were set to 100.

To calculate the number of datasets enriched for a motif we first obtained the average Fisher-log score and KS log score for datasets ChIPped for the same TF. Once we had a set of scores for each TF, we used a binary count of 1 or 0 to indicate whether both of the oPOSSUM enrichment scores passed a threshold based on the standard deviation (SD) of the scores or not (two SD for Fisher-log scores and one SD for KS log scores). This yielded the number of datasets with enrichment around the sequence midpoint for each of the 165 TFs. We then applied a further correction to compensate for the bias created by multiple datasets for families of TFs that recognize the same motif (for example, JUN, JUND, JUNB, AP1, FOS, FOSL1, FOSL2, and BATF PWMs all recognize a TGA(g/c)TCA consensus). The number of motif-family members, minus one, was subtracted from the count of datasets for each of the member TFs, for example, if JUNB were enriched in 20 TF datasets, and 9 of those datasets were ChIPped for a TF that recognizes the JUN-motif family consensus, then a count of eight would be subtracted from 20. The 165 TFs were then ranked according to this final number of associated datasets.

### Motif over-representation analysis with shuffled matrices

To assess the probability of a PWM’s predictions being enriched within as many datasets as observed with the zinger PWMs, we shuffled the PFMs of the zingers and fit a distribution to the results. We generated 100 shuffled matrices as described above. We performed oPOSSUM enrichment analyses with the shuffled PWMs, on the same human datasets as used to generate Figure 1. The oPOSSUM results were evaluated as outlined above. However, we applied the enrichment score thresholds for each dataset as was set for the original PWMs. We then counted the number of datasets within which each shuffled profile was enriched, and fit a zero-adjusted logarithmic distribution (ZALG) to the counts. The distribution was selected using the fitDist() function in the R statistical package GAMLSS 4.1-5 [40], and the parameters describing the distribution were obtained with gamlss family ZALG and the gamlss() function. We tested for goodness-of-fit of the distribution to the data by generating datasets from the random generation function, rZALG, and assaying the similarity of the generated distributions to our data using a chi-squared test. The fitted distribution function was then used to determine the probability of the shuffled PWMs obtaining a result as extreme as the original PWM. The probability was calculated with the density function for the zero-adjusted logarithmic distribution (dZALG).

### Motif prediction

Motif prediction was performed with C-code adapted from the TFBS Perl Module [39], reporting relative motif scores. Motifs predicted by a PFM are not permitted to overlap by more than one-fifth the PFM length (this setting is intended to equate to the low information content flanks of a PWM), for example, a 7 bp motif could only overlap a neighboring motif by 1 bp.

For post-oPOSSUM analyses, we predicted the presence of zinger motifs using one PWM per zinger TF motif family as proxy, to prevent redundancies. CTCF-like motifs were predicted with the CTCF PWM, ETS-like motifs with the GABPA PWM, JUN-like motifs with the JUN PWM, and THAP11 motifs with a THAP11 PWM.

### MEME suite tools

MEME [8] analyses were run using the following options: -dna -nmotifs 10 -minw 6 -maxw 15 -maxsize 2000000 -mod zoops -revcomp. TOMTOM [14] analyses were run with default values, aside from increasing the E-value threshold to 20, from the web server.

### Repeat-masking

Masking of repeat elements was performed using a local installation of RepeatMasker (RMBlast) [41] and RepBase [42], using default settings.

### Data processing and statistical analyses

Data processing and statistical analyses were done with a combination of in-house Unix and R scripts (R version 2.14.1) [40]. Throughout the manuscript we report the combination of median and the median absolute deviation (MAD), a measure of dispersion around the median. For a normal distribution the median and MAD are the same values as the mean and SD.

### TFBS-landscape visualization plots

To visualize peakMax proximal enrichment of TF motifs within ChIP-seq datasets, the top scoring predicted motif in each region for the given TF PWM, was plotted relative to its signed distance from the peakMax (using the R basic statistical package [40]). The dense horizontal ranges of motif scores spanning all positions relative to the peakMax, such as seen in the Figure 2 plots, are observed for the combination of most PWMs and ChIP-seq datasets, and are likely a mixture of both false and true TFBS predictions. Those motif matches that are distal to the peakMax are anticipated to be less reliable, as the observed frequency is consistent with background rates of motif prediction. If we take enrichment proximal to the peakMax as a measure of confidence for the predictions we can determine a distance threshold and motif score threshold (see next section) at a point where motif frequency proximal to the peakMax is greater than the flanking distal motif frequency. Using this threshold, we can select a sub-population of peaks that are less likely to have arisen by chance.

### Heuristic boundaries of enrichment

We assessed the enrichment of motif distance to the peakMax and motif score, using a heuristic method for topological motif enrichment [18], which we outline in brief here. To determine whether a motif was proximal to the peakMax, we used heuristic distance boundaries derived from the density of the top scoring motif for each 1,001 bp region. We identified the location, relative to the 501st bp, at which the density of motifs exceeds that of the distal region (approximately 175 to 500 bp distant from the peakMax). This change in density is observed in the TFBS-landscape plots of Figure 2, where there is a constant density of motif scores in the distal regions and an increase in the density of motif scores within approximately 100 bp of the peakMax. The heuristic distance boundaries were set at the transition point. A similar procedure was applied to determine a threshold for the motif score, where the motif score threshold was set at the point where the motif enrichment proximal to the peakMax was at least 20% higher than the flanking enrichment. The region defined by the distance boundaries and the motif score threshold, was termed the ‘enrichment zone’. The enrichment zone was subsequently used to specify peakMax enriched proximal motifs. On average, an enrichment boundary was ±90 bp from the peakMax, and the motif relative score threshold was 82.

The heuristic analysis of motif enrichment across datasets reports that on average a CTCF zinger motif is enriched above a motif score threshold of 79, while for JUN the average was 86, for GABPA it was 83, and for THAP11 it was 84. CTCF and THAP11 in particular consistently have enrichment above a motif score threshold of 85 that is strongly distinct from the flanking regions of similar score range, as seen in Figure 2A and D. The regions that flank the peakMax proximal enrichment in Figure 2 are representative of the background expectation of a PWM’s motif prediction. Thus, to reduce the presence of false positive predictions in subsets of peaks we analyzed, we selected, where noted in the main text, peaks with a motif scoring above the motif score threshold of 85. The use of a single threshold permits the processing of data as a single unit. A motif score of 85 is also the default threshold score in the oPOSSUM software.

### Background expectation of motif predictions

To estimate the proportion of regions in a given dataset in which motifs may result from background motif prediction, we compared the count of regions with motifs in the enrichment zone relative to the count of regions with motifs at least 50 bp outside the enrichment zone. The distal ‘zone’ from which counts were determined, was set to be the same length of sequence as the enrichment zone, that is, if the enrichment zone was 200 bp wide, then the distal zone was also 200 bp wide (100 bp from 5′ and 100 bp from 3′ of the region center). To estimate the proportion of regions in the enrichment zone with false positives, we divided the number of regions with motifs in the distal zone by the number of regions with motifs in the enrichment zone. See Additional file 4: Text S1 for the estimated overall background expectation of ChIPped TF and zinger motif prediction.

Calculating the background corrected estimates of ChIPped TF and zinger motif proportions within a dataset was done by subtracting the distal zone count from the enrichment zone count for the ChIPped TF or each zinger. For the ChIPped TF, the corrected count was divided by the size of the dataset. For the four zingers, the four corrected counts were first summed, and then divided by the size of the dataset.

### Heatmaps and correlation between zinger motifs

Heatmaps were created with the heatmap.2() function from the R statistical package: gplots, with the distance measure as ‘manhattan’ and the ‘ward’ agglomeration method for clustering.

The heatmap of zinger motif peak log_{2} fold enrichment was generated using the log_{2} fold enrichment of zinger motif peaks with motif score 85 or greater, relative to distal-zinger peaks of similar score range. Where the fold enrichment was below 1.5 we assigned a minimum value, represented as a grey colour in the heatmap, to facilitate visualization.

A heatmap of zinger motif inter-dependency within datasets was generated using the set of zinger motif peaks with motif scores equal to or greater than 85, and a 2×2 confusion matrix for each pair of zinger motifs. A Fisher exact *P* value <0.001 was taken to indicate significance and the sign of the log odds ratio to indicate whether a positive or negative association existed. The values used to generate the heatmap were 1-pvalue for positive associations, -1*(1-pvalue) for the negative associations, and 0 for the non-significant *P* values.

The pairwise correlation of zinger motif peaks for the different zingers, across datasets, was assessed using the log_{2} fold enrichment values generated for the above heatmap. The correlations were evaluated with both Pearson correlation and Spearman’s rank order correlation (R basic statistical package: cor() function).

### ChIP-seq controls

We obtained controls from a range of cell types and ENCODE consortium groups, and processed the mapped reads with FindPeaks. We used the peak height to rank the control peaks, and then selected the top 70,000 peaks. The number of peaks was chosen to match the average size of the ChIP-seq datasets. The peaks were then scored with the zinger PWMs and the enrichment of the motifs with respect to the peakMax was evaluated.

### Evaluating proximity of zinger motif peaks to genomic features

We compared the genomic feature proximity of zinger motif peaks, with those peaks containing the ChIPped TF’s motif and lacking zinger motifs. We measured the distance between the peakMax and the middle of the feature, which in the case of transcription start sites (TSSs) was simply the starting coordinate of the transcript. We used only those datasets for which we had at least 200 zinger motif peaks. The number of peaks that were within 500 bp, 1 kb or 5 kb of the TSS, or within 500 bp of CpG islands, conserved regions or repeat-masked regions were compared between the zinger motif peaks and the ChIPped TF peaks using a Fisher exact test. For the results of a zinger to be considered striking we required that at least 60% of the datasets with zinger motifs show statistical significance in one direction, that is, either 60% of datasets tend to be proximal to a feature, or 60% of datasets tend to be distal to a feature.

### Comparing zinger regions from non-zinger ChIP-seq datasets to peaks ChIPped by the zinger TF

We assessed the proximity of the zinger motif peaks with a high scoring zinger motif (score >85) to ChIP-seq peaks ChIPped by the zinger’s TF to determine whether the zinger motif peaks found in datasets for which the zinger is not the targeted TF, are potential *bona fide* binding regions for the zinger TF. In all cases we required that the zinger motif peaks and zinger TF’s ChIP-seq data be from the same cell line. To call a zinger motif peak in agreement with the zinger TF’s ChIP-seq data we required that the peakMax of the zinger motif peak be within 100 bp of a peakMax in the zinger TF’s dataset. This 100 bp distance reflects the average range of enrichment for a TF’s motif relative to the peakMax. The assessment of the distal-zinger peaks, that is, those peaks with motifs not proximal to the peakMax, relative to the zinger TF’s ChIP-seq dataset was performed in the same manner.

### Generation of ChIP-seq peak neighborhoods

To determine the degree of recurrence for a zinger motif peak region across multiple datasets we pooled all zinger motif peaks that had a high scoring (score >85) zinger motif from all datasets. We then calculated the inter-zinger distances between each zinger motif peak and its nearest neighbour in the 3′ direction on the plus strand. Consecutive peaks that were within 50 bp of their nearest neighbor were merged into a ‘zinger neighborhood’. The distance of 50 bp was chosen as a stringent measure of proximity between zinger motif peaks. For each neighborhood, we counted the number of unique TFs that ChIPped the zinger motif peaks and the number of unique cell lines. We provide the coordinates for the zinger neighborhoods in Additional file 12: Dataset S1.

We generated neighborhoods from the remaining two groups of peaks in a similar manner: those with the ChIPped TF motifs and lacking zinger motifs (‘ChIPped TF neighborhoods’), and those without either motif (‘unidentified motif neighborhoods’). For the ChIPped TF neighborhoods we required that there be a high scoring motif (score >85) for the ChIPped TF. Neighborhood widths were <150 bp on average. As stated in the main text, zinger motif peaks may be *bona fide* binding regions for the zinger TF. Thus, after generating the neighborhood sets, we removed from the ChIPped TF neighborhoods those regions that were within 300 bp (measured centre to centre of the zinger neighborhoods to ensure that comparisons were made between distinct neighborhood sets. We also removed from the ChIPped TF neighborhoods those regions that overlapped the unidentified motif neighborhoods in the same manner.

### Neighborhood proximity to cohesin and polycomb repressive complex

To assess whether a neighborhood is proximal to a region occupied by cohesin or the polycomb repressive complex (PRC) 1 or 2, we generated three datasets by combining the ENCODE ChIP-seq data for the cohesin proteins, RAD21 and SMC3, into a one dataset; combining the ENCODE ChIP-seq data for CBX to form a dataset for PRC1 occupancy, and lastly combining EZH2 ChIP-seq data into a dataset for PRC2 occupancy. We then assessed how many zinger neighborhoods were situated within 500 bp of one of the three protein complexes, measuring from the center of a neighborhood to the ChIP-seq peakMax, and compared this to the two other neighborhoods.