Cancer-specific CTCF binding facilitates oncogenic transcriptional dysregulation

Background The three-dimensional genome organization is critical for gene regulation and can malfunction in diseases like cancer. As a key regulator of genome organization, CCCTC-binding factor (CTCF) has been characterized as a DNA-binding protein with important functions in maintaining the topological structure of chromatin and inducing DNA looping. Among the prolific binding sites in the genome, several events with altered CTCF occupancy have been reported as associated with effects in physiology or disease. However, there is no hitherto a comprehensive survey of genome-wide CTCF binding patterns across different human cancers. Results To dissect functions of CTCF binding, we systematically analyze over 700 CTCF ChIP-seq profiles across human tissues and cancers and identify cancer-specific CTCF binding patterns in six cancer types. We show that cancer-specific lost and gained CTCF binding events are associated with altered chromatin interactions in patient samples, but not always with DNA methylation changes or sequence mutations. While lost bindings primarily occur near gene promoters, most gained CTCF binding events are induced by oncogenic transcription factors and exhibit enhancer activities. We validate these findings in T-cell acute lymphoblastic leukemia and show that oncogenic NOTCH1 induces specific CTCF binding and they cooperatively activate expression of target genes, indicating transcriptional condensation phenomena. Conclusions Cancer-specific CTCF binding events are not always associated with DNA methylation changes or mutations, but can be induced by other transcription factors to regulate oncogenic gene expression. Our results substantiate CTCF binding alteration as a functional epigenomic signature of cancer.


Background
The eukaryotic genome is known to fold into a hierarchical three-dimensional (3D) structure organized by numerous chromatin and transcription factor (TF) proteins [1]. High-throughput technologies such as Hi-C has helped delineate components of 3D genome organization, including topological associating domains (TADs) [2][3][4] and DNA loops [5]. Studies have shown that various protein factors have roles in chromatin folding that is required for proper gene expression [3,[6][7][8][9]. One such factor is CCCTC-binding factor (CTCF), a DNA-binding protein that induces chromatin looping and binds at TAD boundaries [10]. CTCF is integral to cell survival as total knockouts in mice are lethal early in embryogenesis and heterozygous knockouts are significantly predisposed to cancer [10][11][12]. Our previous studies using T-cell acute lymphoblastic leukemia (T-ALL) models have shown that cell-type conserved constitutive CTCF binding sites frequently occur at chromatin domain boundaries and facilitate interactions between TF-bound distal enhancers and their target genes [13]. We demonstrated that TAD boundary intensity associates with CTCF levels, which might also serve to isolate superenhancers [14]. While CTCF binding at TAD boundaries is usually conserved across diverse cell types and throughout development [15], we and others have shown that CTCF binding within TADs can also exhibit tissue specificity [14,[16][17][18].
The functional importance of CTCF is highlighted in disruptions to CTCF binding coupled with alterations in gene expression, which have been widely observed [19][20][21][22]. Deletions of insulator CTCF binding sites can cause aberrant chromatin interactions and differential expression of genes within TADs in developmental disorders and cancers [19,20,[23][24][25]. Many cases of CTCF disruption have been associated with changes in DNA methylation such as in isocitrate dehydrogenase (IDH) mutant gliomas [21], succinate dehydrogenase (SDH)-deficient gastrointestinal stromal tumors (GIST) [22] and immunoglobulin or T-cell receptors [18].
Additionally, the CTCF gene itself or its binding sequence can be mutated and has been 5 comprehensively study the genomic repertoire of CTCF binding, we collected 771 high-quality CTCF ChIP-seq datasets from the public domain. These datasets cover over 200 human cell types, including normal tissues and multiple cancer types (Fig. S1a, Table S1). We collectively identified 688,429 distinct CTCF binding sites by merging shared peaks called from each dataset (Fig. S1b-c). To study the binding occupancy pattern across cell types, we assigned an occupancy score to each CTCF site by tallying the ChIP-seq datasets exhibiting peaks within the site (Fig. S1a). We obtained a broad spectrum of CTCF binding site distribution from celltype specific to cross-cell-type conserved (constitutive) (Fig. 1b) and focused on the 285,467 high-confidence CTCF binding sites with occupancy score ≥ 3. We identified 22,097 constitutive CTCF binding sites, which were defined as binding present in at least 80% of all 771 datasets determined by an empirical model (Fig. 1b, S1d).
To identify cancer-specific CTCF binding patterns, we surveyed six cancer types: T-cell acute lymphoblastic leukemia (T-ALL), acute myeloid leukemia (AML), breast cancer (BRCA), colorectal cancer (CRC), lung cancer (LUAD) and prostate cancer (PRAD). These cancers have CTCF ChIP-seq data available in both cell lines and corresponding normal tissues (Table S2).
We compared both CTCF occupancy frequencies and normalized CTCF binding levels in samples from each cancer type versus all other datasets (Fig. 1c, S1e-i) as well as their corresponding normal tissue (Fig. S1j) to account for variations due to tissue type specificity.
We categorized a CTCF binding event as lost in a cancer if it had a lower occupancy score and lower binding level in the cancer cell lines compared to the corresponding normal tissue and to all other samples. A site was characterized as gained in a cancer if it had a higher occupancy score and higher binding level in the cancer cell lines compared to the corresponding normal tissue as well as to all other samples. Using this approach, we identified lost and gained CTCF binding sites in each of the six cancer types. The CTCF binding patterns at 102 lost and 72 gained sites identified in T-ALL are shown in Fig. 1d-f, comparing normal CD4 + T cell with two 6 T-ALL cell lines, Jurkat and CUTLL1 [33,34,[36][37][38][39] . Different cancer types share few commonly lost or gained sites, indicating cancer-type specificity of the identified CTCF binding patterns ( Fig. 1g). Across cancer types, however, lost sites are enriched at gene promoter regions (<2kb from any transcription start site, TSS), while gained sites are primarily located at distal and noncoding regions (Fig. 1h). This suggests that different cancers may employ similar mechanisms leading to CTCF binding loss or gain.
We further explored these lost and gained CTCF binding events identified from cancer cell lines in patient samples, to confirm that these unique patterns are not cell line-specific phenomena. In two T-ALL patient samples, 78 of the 102 lost sites (T-ALL lost ) were also depleted in at least one sample, and 33 of the 72 gained sites (T-ALL gained ) are present in at least one sample (Fig. S1k, Table S3). As CTCF binding positively correlates with chromatin accessibility (Fig. S2a, Table   S4), we systematically surveyed ATAC-seq data in BRCA, COAD, LUAD and PRAD patient samples from The Cancer Genome Atlas (TCGA) [40], and consistently observed that lost or gained CTCF sites identified from cell lines exhibit lower or higher chromatin accessibility, respectively, compared to the entire TCGA cohort (Fig. 1i), indicating that unique losses or gains in CTCF binding exist extensively in cancer patients. Specific CTCF binding pattern may also be indicative of clinical outcomes, as patients with elevated chromatin accessibility at gained CTCF binding sites have lower overall survival rates (Fig. S2b,c).

Unique CTCF binding sites link to patterns of chromatin interaction
As CTCF is known to induce DNA looping and is enriched at TAD boundaries [1], we then interrogated the relationship between altered CTCF occupancy and chromatin conformation in cancer. We performed in situ Hi-C in Jurkat, healthy donor CD4 + T cells, and patient T-ALL cells [5,41,42]. Differential analysis of our Hi-C data reveals that T-ALL lost sites have decreased and T-ALL gained sites have increased contact frequencies with their flanking regions (Fig. 2a,b,   7 S3a) compared to constitutive CTCF sites as controls. These findings were corroborated in our two T-ALL patient samples ( Fig. 2c-f, S3b,c) and in other malignancies (Fig. 2g,h, S3d, Table   S4). Together, these results suggest that cancer-specific alterations to CTCF binding highly associate with changes in local chromatin contacts relative to their normal physiological state.
In addition to regulating chromatin conformation, CTCF occupancy has been suggested to act as a boundary against spreading of histone modifications between loops and TADs [2,5].
Therefore, we assessed whether cancer-specific CTCF binding is implicated in histone modification patterns. Using publicly available ChIP-seq data, we examined the activating histone marks H3K4me1, H3K27ac, and the repressive mark H3K27me3, and found that cancer-specific gained CTCF binding associates with increased levels of enhancer marks H3K4me1 and H3K27ac, while lost CTCF sites do not significantly correlate with any of the histone modifications tested (Fig. S4). Analysis of T-ALL patient samples yielded similar results (Fig. S5).

Cancer-specific CTCF binding gain and loss associate with differential gene expression within chromatin domains
To study the effects of CTCF binding on gene expression, we used an unbiased approach in which we compiled a comprehensive list of all possible combinations of CTCF site and gene pairs that are located within the same chromosome [4,12,13]. We measured both CTCF binding level and gene expression level for each CTCF-gene pair and calculated their correlation across 54 cell types for which both CTCF ChIP-seq and RNA-seq data are available (Table S4). This correlation score represents the association between CTCF binding and gene expression (Fig.   3a). Upon dividing the CTCF-gene pairs into two groups based on whether the paired loci are in the same or different divergently oriented constitutive CTCF-bound chromatin domains [13] (Fig.   3b), we found that pairs in the same domain are more likely to be highly correlated, regardless 8 of genomic distance (Fig. 3c). This indicates that any effect of CTCF binding in regulating gene expression tends to be confined within constitutive CTCF-bound chromatin domains [13]. These domains are highly consistent with TADs identified from Hi-C maps [43] (Fig. S6a,b).
We then tested whether those CTCF binding sites specifically lost or gained in cancer associate with expression of genes within the same chromatin domains. If a CTCF binding site is located in a gene promoter region, we directly used that gene as the candidate target. Otherwise, we assigned all genes located within the same domain as the CTCF site as intra-domain candidate target genes. Using this rubric, we found that cancer-specific lost CTCF binding events tend to have higher correlation with their promoter target genes, which are also more likely to be downregulated in cancer (Fig. 3d). Genes that strongly associate with cancer-specific gained CTCF binding sites, on the other hand, tend to be up-regulated in cancer (Fig. 3e). In general, without considering CTCF-gene pair correlations, genes surrounding lost CTCF binding sites within the same chromatin domain tend to be down regulated while genes surrounding gained CTCF binding sites are more likely to be activated. This relationship holds in both cancer cell lines ( Fig. S6c,d) and the two T-ALL patient samples (Fig. 3f,g). Furthermore, we found that genes highly correlated with BRCA gained intra-domain gained CTCF are enriched for the essential genes identified using CRISPR-screen data from the breast cancer cell line T47D [44], suggesting that gained CTCF is involved in cancer functions (Fig. S6e).

Differential DNA methylation or sequence mutation do not explain all cancer-specific CTCF binding patterns
We next sought to identify determinants of cancer-specific CTCF binding. To date, the primary identified effectors of variation in CTCF binding at specific loci in cancers include altered DNA methylation at a CTCF motif [20,21,29,45] or mutations affecting the CTCF binding motif [29].
Prior studies have shown that CTCF binding negatively correlates with DNA methylation [15,46].
We collected DNA methylation profiles from T-ALL, BRCA, CRC, LUAD and PRAD and their corresponding normal tissue, and calculated the differential levels of DNA methylation over a 300bp region centered at each identified lost or gained CTCF binding site. We noticed that some but not all lost CTCF binding sites associate with increased DNA methylation (Fig. 4a), while most gained CTCF sites do not associate with DNA methylation reduction (Fig. 4b). The result was confirmed in the T-ALL patient samples (Fig. S7). We therefore concluded that DNA methylation can only explain a portion of cancer-specific CTCF binding changes.
Stable CTCF binding is highly specific to the presence of its DNA binding motif and can be compromised by mutations affecting the consensus motif sequence [20,29]. We performed whole genome sequencing (WGS) in T-ALL samples and found very few genetic alterations at gained or lost binding loci (Fig. S8). Using WGS data for AML, BRCA, CRC, LUAD and PRAD patient samples from the International Cancer Genome Consortium (ICGC) [47], we consistently observed that CTCF loss or gain does not associate with mutations altering the consensus binding sequence (Fig. S9). These data show that neither sequence mutations nor DNA methylation changes can fully explain cancer-specific CTCF binding events.
Cancer-specific gained CTCF co-activates target genes with oncogenic transcription factors CTCF has been shown to co-bind DNA with other factors to establish DNA loops and control gene expression [19,48]; thus, we looked for TFs potentially involved in cancer-specific CTCF gain events that associate with dynamic chromatin interaction and increased gene expression.
Using our in situ Hi-C [4,5,14,49,50] data in T-ALL compared with normal CD4 + T cells, we identified genomic regions within the same chromatin domain that interact more frequently with T-ALL gained CTCF sites (Fig. S10a,b) and used BART [51] to identify putative TFs that preferentially bind in these regions. We found that NOTCH1 [13], a major oncogenic driver in T-10 ALL, is one of the top TFs with binding sites enriched in these regions (Fig. 5a). Potential oncogenic TFs in CRC were also identified using the same approach (Fig. S10c, Table S5).
Indeed, compared to normal CD4 + T cells, gained CTCF sites in T-ALL interact more frequently with "dynamic" NOTCH1 binding sites, previously defined as those sensitive to gammasecretase inhibitor (γSI) treatment followed by inhibitor washout [13] (Fig. S10d). Furthermore, we analyzed the genome-wide relationship between NOTCH1 and CTCF occupancy in T-ALL and found that both NOTCH1 and dynamic NOTCH1 sites are significantly enriched in chromatin domains containing T-ALL gained CTCF sites (Fig. 5b, S10e), although NOTCH1 and CTCF do not co-occupy the same loci (Fig. S10f). These T-ALL gained CTCF sites are also associated with increased levels of H3K27ac in T-ALL, indicative of potential enhancer function ( Fig. 5c,d). An example locus is shown in Fig. 5e.

CTCF and NOTCH1 require each other to activate their oncogenic targets in T-ALL
The significant association between T-ALL gained CTCF binding and dynamic NOTCH1 binding suggests that CTCF might cooperate with NOTCH1 to activate gene expression in T-ALL. To test for dependency of T-ALL gained CTCF binding on NOTCH1, we treated Jurkat cells with γSI for 72 hours to inhibit the release and nuclear translocation of the intracellular, transcriptionallyactive domain of NOTCH1, and then washed out the inhibitor to allow for recovery of intracellular NOTCH1 levels for 16 hours. CTCF ChIP-seq showed that γSI treatment abrogated CTCF binding at most T-ALL gained sites (Fig. 6a), and a portion of those γSI-sensitive binding events recovered upon γSI washout (Fig. S11a). Meanwhile, chromatin accessibility decreased at T-ALL gained CTCF sites with γSI treatment compared to DMSO, and significantly reversed after γSI washout (Fig. 6b). These results suggest that functional NOTCH1 binding is required for CTCF binding at T-ALL gained sites.
As NOTCH1 and CTCF do not physically interact with each other (Fig. 6c) and do not co-bind at the same genomic loci (Fig. S10f), we hypothesized that NOTCH1 may mediate the creation of an accessible chromatin configuration to allow for CTCF binding. Recent studies have shown that chromatin remodelers affect CTCF binding [52,53], and NOTCH1 can interact with the catalytic subunit of the mammalian SWI/SNF chromatin remodeling complex BRG1 (SMARCA4), as well as other members of the BAF and PBAF chromatin remodeling complexes [54]. We confirmed the NOTCH1-BRG1 interaction in our T-ALL cell lines (Fig. 6c), which indicates that NOTCH1 may induce chromatin remodeling. Interestingly, BRG1 in the AML cell lines EOL1 and MOLM13 has higher enrichment at AML gained CTCF sites than at constitutive CTCF sites (Fig. 6d, S11b) [55], although CTCF itself has lower binding levels at their respective gained sites in both AML and T-ALL than at constitutive sites ( Fig. 6e, S11c,d), suggesting that gained CTCF binding might need BRG1 to open chromatin. Thus, a potential mechanism by which NOTCH1 permits T-ALL gained CTCF binding could occur through BAF complex recruitment to open chromatin for CTCF binding.
The aforementioned findings suggest a potential role for T-ALL gained CTCF in oncogenic transcription mediated by NOTCH1. To test whether CTCF is required for NOTCH1's oncogenic transcription function, we knocked down CTCF with short hairpin RNAs (shRNA) in T-ALL cells (CUTLL1). Genes in the same chromatin domains containing T-ALL gained CTCF binding sites, especially those genes with higher expression in T-ALL compared to normal CD4 + T cells were significantly affected by CTCF silencing (Fig. 6f), indicating that T-ALL gained CTCF sites are the most disrupted in our silencing study. Interestingly, BART analysis revealed that the shCTCFdownregulated genes are most likely regulated by NOTCH1 (Fig. 6g). Thus, reducing CTCF levels may disrupt NOTCH1's ability to activate its target genes. Indeed, most NOTCH1 target genes in CUTLL1 are downregulated in shCTCF cells (Fig. 6h). Genes down-regulated in shCTCF cells are also significantly enriched for genes down-regulated in γSI-treated cells (Fig.  S11e), and reactivated after γSI wash-out (Fig. S11f). These data show that CTCF is required for NOTCH1 to regulate its target genes. Additionally, we found that genes located in chromatin domains containing both dynamic NOTCH1 and T-ALL gained CTCF sites are most up-regulated in T-ALL compared to normal CD4 + T cells (Fig. 6i). Of these T-ALL-upregulated genes, those located in chromatin domains with increased interaction between dynamic NOTCH1 sites and T-ALL gained CTCF sites are the ones whose expression is the most down-regulated upon CTCF silencing (Fig. S11g). Our collective findings suggest that NOTCH1 and CTCF cooperatively activate oncogenic transcriptional programs in T-ALL.

Discussion
Through integrative analysis of genomic data collected from the public domain, we presented a comprehensive CTCF binding repertoire in the human genome, from which we identified specific CTCF binding patterns in six distinct cancer types. We characterized a series of genomic and epigenomic features of cancer-specific CTCF binding events using multi-omics profiling techniques including WGS, TF and histone modification ChIP-seq, RNA-seq, ATACseq, RRBS, and in situ Hi-C. In contrast to previous studies that primarily focused on the effects of mutations or other modifications to CTCF itself or its binding sites [20,21,29,45,56,57], we identified unique CTCF binding patterns in specific cancer types that arise independently of mutations or DNA methylation changes at binding sites. Instead, cancer-specific CTCF recruitment likely results from other TFs that indirectly open chromatin and alter chromatin conformation. CTCF at these sites functions cooperatively with other TFs to facilitate enhancerpromoter interactions and to activate oncogenic transcription programs. In T-ALL, we identified such a cooperative program occurring between NOTCH1 and CTCF, in which NOTCH1 binding is required for gained CTCF binding in the same chromatin domain. This potentially occurs through NOTCH1-induced opening of chromatin at the CTCF binding sites. Gained CTCF binding then cooperates with NOTCH1 to activate transcription of its target genes. Interestingly, 13 we observed substantial enrichment of BRG1 at gained CTCF binding sites (Fig. 6d), as well as a direct protein-protein interaction between NOTCH1 and BRG1 (Fig. 6c). Although previous studies suggested that CTCF and BRG1 might physically interact 57 , we do not find this to be the case in T-ALL (Fig. S11h).
The dynamic interactions involving multiple factors and novel CTCF binding within a single chromatin domain may indicate the formation of phase-separated transcriptional condensates at super-enhancers [58][59][60]. In T-ALL, NOTCH1 binding drives the establishment of superenhancers [13]. Thus, T-ALL gained CTCF binding may be recruited by clusters of TFs and coactivators including chromatin remodeling complexes within phase-separated transcriptional condensates around super-enhancers. The potential for NOTCH1 as a master TF to direct the formation of 3D spatial clusters has been reported recently [61]. Transcriptional condensates maintain a highly active environment, which is consistent with the enrichment of H3K27ac observed near T-ALL gained CTCF sites (Fig. 5c). By inducing the frequency of chromatin contacts, gained CTCF binding may function to maintain the condensation state that helps drive transcription. A schematic model of the relationships between dynamic NOTCH1 binding, CTCF gain and activation of NOTCH target genes in T-ALL is shown in Fig. 7.
Our work in T-ALL found that gains in CTCF binding are located in distal enhancer regions, while cancer-specific CTCF binding loss events are enriched at gene promoter regions and correlate with repressed transcription of these promoters and decreased chromatin interactions.
Recently, an enhancer-docking mechanism described by Schuijers et al. [62] proposed that a single CTCF binding upstream of a promoter can function as a docking site for multiple distal enhancers; in this way, multiple enhancers loop to a single CTCF site to activate a single target gene promoter [62]. Loss of such a docking CTCF site then removes the ability to form these multiple enhancer loops, thus greatly reducing the ability to activate transcription. While our observations of cancer-specific lost CTCF sites are consistent with this "enhancer docking" model, further studies are required to understand the causal relationships between CTCF binding loss and gene repression.
Our study is built upon integrative computational analyses of multi-source public data coupled with our multi-omics experimental validations using T-ALL as a model system. As a pan-cancer study, our work is limited by data availability and quality. For example, coverage of RRBS and sequencing depths might lead to potential underestimation of differential DNA methylation.
Nevertheless, existence of gained CTCF binding independent of DNA methylation is validated.
Our findings pave the way for further mechanistic studies of causal relationships between CTCF binding alteration and oncogenic TF activities in leukemia as well as other cancers. Following our proposed model, oncogenic drivers can lead to novel CTCF binding at distinct enhancer regions in the genome, thus creating a signature pattern of CTCF binding. Having observed evidence supporting this model in T-ALL, we believe that studying aberrant CTCF binding events in other cancer types can further our understanding of the underlying oncogenic transcriptional regulatory networks specific to that cancer. In conclusion, unique aberrant CTCF binding pattern represents a novel epigenomic signature of cancer that can be independent of mutations and DNA methylation changes. Our work provides insights into a new angle of mechanistic research on cancer epigenomics.

In Situ Hi-C
In situ Hi-C was performed on CD4+ T cells, Jurkat, CUTLL1, and patient xenografts as previously described [5]. In brief, cells were crosslinked with 1% formaldehyde for 10 minutes at room temperature. Per Hi-C reaction, 5 million cells were lysed and nuclei were permeabilized.
DNA was digested with MboI from New England Biolabs (R0147M). Digested fragments were labeled with biotinylated d-ATP from Jena Bioscience (NU-835-BIO14-S) and ligated. After RNase treatment and Proteinase K treatment to reverse crosslinks, nuclei were sonicated using 16 a Covaris E220 to produce an average fragment length of 400 bp. Streptavidin beads from ThermoFisher Scientific (65001) were used to pull down biotin-labeled fragments. Following purification and isolation of DNA, final libraries were prepared using the NEBNext® Ultra™ II DNA Library Prep Kit for Illumina® and sequenced via paired end sequencing at a read length of 150 bp on an Illumina HiSeq 2500 to produce on average 400 million reads per sample.

ChIP-seq Profiling
CD4+ T cells, Jurkat, CUTLL1, and patient xenografts were crosslinked with 1% formaldehyde and 1% fetal bovine serum in PBS for 10 minutes at room temperature. The reaction was quenched with 0.2M glycine at room temperature for 5 minutes. Cells were then washed with PBS and pelleted.
For CTCF ChIPs, immunoprecipitation was performed based on a protocol described previously [64]. A pellet containing 50 million cells was lysed with 5mL of lysis buffer (50 mM HEPES-KOH, pH 7.5, 140 mM NaCl, 1 mM EDTA, 10% glycerol, 0.5% NP-40, 0.25% Triton X-100) for 10 minutes at 4°C. Nuclei were pelleted at 1350xg for 7 minutes and resuspended in 10 mM Tris pH 8, 1 mM EDTA and 0.1% SDS. Chromatin was sheared with a Covaris E220 system to an average fragment length of 400 bp and spun at 15,000 rpm for 10 minutes to remove insoluble chromatin and debris. The supernatant was incubated with 20 μL of Dynabeads Protein G for 30 minutes before discarding the beads. 1% of the total volume was saved as input and the rest was incubated with anti-CTCF antibody overnight. 100 μL of

PUBLIC DATA COLLECTION
Public CTCF ChIP-seq data were collected from Cistrome Data Browser [65] (for peak files) and NCBI GEO [66] (for fastq files, Table S1). Histone modification ChIP-seq data were collected from NCBI GEO and ENCODE [67] (for bam files). Public RNA-seq data in multiple cell types were collected from ENCODE (for fastq files). DNA methylation profiling data were collected from ENCODE (for bed bedMethyl files) and NCBI GEO. Hi-C data were collected from NCBI GEO and ENCODE (for fastq files). ATAC-seq data were collected from NCBI GEO (for fastq files). Whole Genome Sequencing data for BRCA, COAD, LUAD and PRAD samples were collected from International Cancer Genome Consortium (ICGC) Data Portal [47]. Survival data for BRCA, COAD and LUAD were downloaded from the TCGA Data Portal [68]. Detailed information including accession IDs of all public datasets collected in this work can be found in Table S4.
DESeq2 [79] was used to identify differentially expressed genes, and different thresholds used in different analysis were listed correspondingly in the manuscript.
Contact maps were generated at a resolution of 5kb. Raw matrix data were normalized using the approach described in Normalization of Chromatin Interactions.

Whole Genome Sequencing Data Analysis
Mutations were identified for two T-ALL cell lines (Jurkat and CUTLL1) and two T-ALL patient samples from the whole genome sequencing data. We aligned the Illumina short-read sequences to the human reference genome (GRCH38/hg38) using BWA [70] mem. We used SAMBlaster [83] to identify the discordant pairs, split reads and flag the putative PCR duplicates.
We used SAMBAMBA [84] to convert the SAM aligned into the BAM format, and samtools [71] was used to sort those aligned to create a BAM file corresponding to each sample.
We used VarDict [85] to identify the variants that overlapped the union CTCF binding sites. We used all the default parameters except '-f 0.1' which was used to identify variants that were supported by greater than 10% of the reads at that location. We annotated the variants using Variant Effect Predictor (VEP) [86], and used custom scripts to identify the variants that influence TF binding.
We again used VarDict [85] to identify the variants in the CTCF and NOTCH1 genes for the four samples. We used all the default parameters except '-f 0.1' which was used to identify variants that were supported by greater than 10% of the reads at that location. We annotated the variants using Variant Effect Predictor (VEP) [86], and then filtered it to identify the mutations that were either (a) not seen in more than 1% of any normal human population, or (b) had a CADD score of deleteriousness > 20, or (c) was present in the COSMIC database.

Identification of CTCF Binding Repertoire in the Human Genome
For CTCF ChIP-seq, we included a total of 793 datasets, including 787 public datasets and 6 datasets we generated (Table S1). 765 public CTCF ChIP-seq datasets with peaks more than 2,000 were further included in this study. Each dataset can yield MACS2-identified CTCF peaks in the range between 2,050 and 198,021, with a median of 46,451, and a total of 36,873,077 peaks (Fig. S1b). The distribution of the interval lengths between adjacent CTCF peak summits 24 of all 36,873,077 peaks from 771 datasets has an inflection point at ~150bp (Fig. S1c) indicating the boundary between the same binding site and different binding sites [87]. Therefore, we used 150 bps as the cutoff to merge CTCF peaks. In practice, we extended +/-75 bps from each peak summit to generate a 150bp region centered at the summit to represent each peak, and merged all overlapping peak regions to generate a union set of CTCF binding sites, which contains 688,429 non-overlapping sites. Each binding site was assigned a CTCF occupancy score, defined as the tally of ChIP-seq datasets that exhibit a peak within the site. Accordingly, we defined the occupancy frequency as the ratio of the occupancy score over the total number of CTCF ChIP-seq datasets. To further ensuring the robustness of the identified CTCF binding sites, we selected 285,467 high-confidence sites with occupancy score ≥3 for downstream analyses. CTCF motifs within the union binding sites were searched by FIMO [88] with Jaspar [89] matrix (ID: MA0139.1), with a p-value threshold of 1e-4. One motif with the smallest p-value was retained for each CTCF binding site.

Identification of Constitutive CTCF Binding Sites
The distribution of occupancy scores of all 285,467 CTCF binding sites (Fig. S1d, blue curve) shows that the majority of the CTCF binding sites occur in only a few datasets, and the number of binding sites decreases with increasing occupancy score when the occupancy score is small.
However, there are CTCF binding sites that are highly conserved across almost all datasets (e.g., binding sites with occupancy score greater than 600). We use a power law function to fit the distribution curve (blue) shown in Fig. S1d to determine the cutoff for constitutive CTCF sites. We denote ! " as the number of observed CTCF binding sites with occupancy score equal to #, and $ " as the number of expected CTCF sites with occupancy score equal to #. The power law fitting to data ! " can be described as (Fig. S1d, green): 34.56 where 0 is the occupancy score. We define the cutoff 7 for constitutive CTCF binding sites as: In other words, the total observed CTCF sites with occupancy score greater than A should be 6 times more than expected. We then determined A=615, and used an occupancy frequency cutoff of 80% to define 22,097 constitutive CTCF binding sites, which corresponds to the occupancy score ≥616 in all 771 CTCF ChIP-seq datasets.

Identification of Cancer Specific Gained/Lost CTCF Binding Sites
We used the following 2 criteria to identify cancer-specific lost CTCF binding sites: 1) The CTCF binding site should have a significantly lower occupancy frequency for datasets of that cancer type compared to the occupancy frequency for all datasets; and 2) CTCF binding level (quantified as normalized ChIP-seq read counts) at the site is significantly lower in cancer datasets than in other datasets. For gained CTCF sites, we used the vice versa set of criteria.
Briefly, for each CTCF binding site in each cancer type, the occupancy score in the cancer datasets were calculated along with its occupancy score in all 771 datasets. CTCF binding levels were obtained from a normalized read count matrix in which the ChIP-seq read counts (RPKM) were first calculated for union CTCF binding sites in all datasets and then followed by quantile normalization. We used unpaired two-tailed Student's t-test to quantify the difference of binding levels between different groups of datasets, and the p-value was then adjusted using the Benjamini-Hochberg Procedure [90]. In addition, binding occupancy scores and binding levels were compared between cancer datasets and datasets from the matched normal tissue or cell types, in order to take into account the potential confounding factor of tissue-specificity rather than cancer-specificity. Detailed criteria for identifying cancer specific CTCF binding sites are described below: • Cancer-specific lost CTCF binding sites: (1) occupancy frequency ≤0. The specific gained and lost CTCF binding sites for each cancer type are shown in Table S3.

Quantification of Differential Chromatin Accessibility
We used the processed data from Ref. [40] that include a matrix of normalized ATAC-seq insertion counts within the TCGA pan-cancer peak set to assess the differential chromatin accessibility around CTCF binding sites. For each cancer type among BRCA, CRC, LUAD and PRAD, the pan-cancer ATAC-seq peaks that overlap with identified cancer-type-specific lost or gained CTCF binding sites were used for downstream analyses. The ATAC-seq differential score for each peak was quantified as the fold change of the average of the normalized ATACseq insertion counts from patient samples in the corresponding cancer type versus from patients in other cancer types, and the ATAC-seq differential score was then assigned to the peak overlapped CTCF binding site.
For consistency, we applied the same approach used for TCGA ATAC-seq data to analyze the collected ATAC-seq data from T-ALL cell line Jurkat and normal CD4+ T cells. A data matrix 27 was generated using ATAC-seq raw read counts on union CTCF binding sites for all Jurkat and T cells datasets. Quantile normalization was applied on the log2 scaled matrix (pseudo count =5). The ATAC-seq differential score was measured as the fold change of the averaged normalized ATAC-seq counts between datasets of Jurkat versus CD4+ T cell at each CTCF binding site.

Survival Analysis
Survival analysis was applied on patient samples having both supported TCGA clinical data and ATAC-seq data [40]. For each cancer type among BRCA, CRC and LUAD, the patient samples are separated into two equal-sized groups based on the chromatin accessibility at identified cancer-specific lost or gained CTCF binding sites as follows: (1)  Kaplan-Meier (K-M) method was then used to create the survival plots and log-rank test was used to compare the differences of survival curves.

Normalization of Chromatin Interactions
Given a Hi-C contact map 7 = {A "B }, the score A "B reflects mapped reads between two genomic regions # and C. Suppose the bin size is 5kb, regions # and C will have a genomic distance of |# − C| * 5DE. Since the contact probability between two bins decreases with increasing genomic distance [91], we normalized the contact map as follows: for any given genomic distance F G = D * 5DE, we quantify a normalization factor H I J as the averaged interactions among all bin pairs with the same genomic distance F G in a same chromosome, e.g., H I

Detection of Differential Chromatin Interactions
We denoted the normalized Hi-C contact maps in the cancer dataset and the normal dataset as Student's t-test was then applied on TO and TQ to quantify the differential interaction between cancer and normal cells surrounding CTCF binding site 0.

Association of CTCF Binding with Gene Expression
54 cell types for which both CTCF ChIP-seq data and RNA-seq data are publicly available were selected (Table S4) for investigating the association between CTCF binding and gene expression for each CTCF-gene pair in the same chromosome. To obtain the CTCF binding level, a read count matrix was generated using reads per kilobase per million (RPKM) on union CTCF binding sites from ChIP-seq data. The read count matrix was scaled with square root of RPKM followed by quantile normalization. Gene expression level was measured for each gene 29 using the square root of transcripts per million (TPM) from RNA-seq data. For each CTCF-gene pair, we quantified the association between the CTCF site and the gene across all 54 cell types using the correlation coefficient R between the normalized CTCF binding level and gene expression (Fig. 2e). CTCF-gene pairs were deemed "highly correlated" with R 2 greater than 0.25.

Identification of Constitutive CTCF-Bounded Chromatin Domains
For each CTCF binding site, we defined its associated chromatin domain as the genomic region that: (1) includes this specific CTCF binding site; (2)

Detection of DNA Methylation Changes Surrounding CTCF Binding Sites
DNA methylation changes were detected within a 300bp region centered at each CTCF binding site. Regions with at least 3 CpGs covered by at least 5 reads (≥5x) in both cancer cell lines and matched normal tissues were retained. A 300bp region was detected as differentially methylated if the averaged differential methylation levels of all CpGs (≥5x) within this region was greater than 20% [92].

Detection of Differential Mutation Rate and Motif Score
For each CTCF binding site, the mutation rate was calculated as the occurrence of mutation events against the number of samples/patients at each single base pair within a 400bp region centered at the CTCF binding site. The mutation rate for a group of CTCF binding sites was considered as the averaged mutation rate for each base pair within the 400bp region.

30
Motif score was measured by scoring the CTCF position weight matrix (Jaspar [89], Matrix ID: A,C,G,T). The differential motif score was calculated by comparing motif scores for the reference and the mutated sequences.

Identification of CTCF Intra-Domain Differentially Interacted Regions
For a given set of CTCF binding sites, the chromatin interaction changes between a CTCF site and each of its intra-domain non-overlapped bins, measured from normalized Hi-C contact maps in cancer cells over matched normal cells, were collected for each of the CTCF binding sites (Fig. S11b). Regions with decreased interactions (log2 FC <-1, averaged log2 interaction >0) with cancer-specific lost CTCF binding sites, and regions with increased interactions (log2 FC >1, averaged log2 interaction >0) with cancer-specific gained CTCF binding sites were used for downstream transcription factor (TF) enrichment analysis.

Transcription Factor Enrichment Analysis
A revised version of the BART algorithm [51] was used for TF enrichment analysis. Briefly, a collection of union DNase I hypersensitive sites [93] (UDHS) was previously curated as a repertoire of all candidate cis-regulatory elements in the human genome, and 7032 ChIP-seq datasets were collected for 883 TFs [51], with each TF having one or more ChIP-seq datasets from multiple cell types or conditions. A binary profile was generated for each TF on UDHS indicating whether the TF has at least one peak from any of its ChIP-seq datasets locate within each of the UDHS. Binding enrichment analysis was applied for each TF by comparing the TF binding on a subset of UDHS overlapping the selected genomic regions versus the TF binding on UDHS. P value was obtained using two-tailed Fisher's exact test.

Availability of data and materials
The datasets generated in this study are available in NCBI GEO repository, under accession number GSE130140 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE130140). The public data used and analyzed in this study are summarized in Methods, with the accession information included in Tables S1, 2, and 4. All source code for analyzing the data and generating the results and figures in this paper are available at GitHub (https://github.com/zanglab/CTCF_T-ALL_code).

Competing interests
The authors declare that they have no competing interests.