- Open Access
Chromatin interaction neural network (ChINN): a machine learning-based method for predicting chromatin interactions from DNA sequences
Genome Biology volume 22, Article number: 226 (2021)
Chromatin interactions play important roles in regulating gene expression. However, the availability of genome-wide chromatin interaction data is limited. We develop a computational method, chromatin interaction neural network (ChINN), to predict chromatin interactions between open chromatin regions using only DNA sequences. ChINN predicts CTCF- and RNA polymerase II-associated and Hi-C chromatin interactions. ChINN shows good across-sample performances and captures various sequence features for chromatin interaction prediction. We apply ChINN to 6 chronic lymphocytic leukemia (CLL) patient samples and a published cohort of 84 CLL open chromatin samples. Our results demonstrate extensive heterogeneity in chromatin interactions among CLL patient samples.
Chromatin interactions play important roles in regulating gene expression [1,2,3]. They bridge enhancers to genes [4,5,6] and create insulated domains to constrain the reach of enhancers . High-throughput experimental techniques such as high-throughput chromosome conformation capture (Hi-C)  and chromatin interaction analysis with paired-end tags (ChIA-PET)  have been developed to detect genome-wide chromatin interactions. These techniques greatly advanced the understanding of genome organization and its roles in transcription regulation [4, 10,11,12]. However, due to costs and technical challenges, these methods have not been widely applied to large cohorts of cell lines or clinical samples. Hence, our understanding of how common or rare chromatin interactions are in different patient samples is limited.
A predictor that uses DNA sequences to predict chromatin interactions could potentially expand our understanding of genome organization. Sophisticated computational methods such as DeepSea  and DeepBind  have demonstrated that many transcription factors binding sites in open chromatin regions could be predicted from DNA sequences. Additionally, various computational methods have been developed to predict chromatin interactions to complement the experimental techniques [15,16,17,18,19,20,21]. Many of these methods rely on using various functional genomics data, meaning the use of chromatin immunoprecipitation sequencing (ChIP-seq) data of transcription factors and histone modifications, open chromatin data, and transcription data [15, 17, 19, 21].
Methods such as RIPPLE , TargetFinder , and JEME  reported high performances in predicting enhancer-promoter interactions using supervised machine learning approaches. However, the reported performances were exaggerated by using cross-validation with random splitting of samples [22, 23]. The lack of effective machine learning approaches has motivated the field to develop new methods.
Recently, the convolutional neural network framework was adapted to predict Hi-C contact matrices from 1-dimentional sequence data in a method called “Akita” . There are other methods that predict Hi-C-like data and chromatin interactions, namely DeepTACT , SEPT , and DeepC . A detailed comparison and description of these methods is discussed in Additional file 1: Table S1.
CTCF-associated genome folding patterns can be observed in the prediction results of Akita, suggesting the importance of CTCF in regulating chromatin interactions. In addition, prediction results can recapture the differences in genome folding between a normal and genetically altered cell lines, indicating that machine learning framework can predict different genome folding profiles given different input DNA sequences.
However, there are several limitations to these methods. First, Akita and DeepC only performs predictions with limited sequence regions (in the case of Akita, this is 1 Mb), thus long-range chromatin interactions cannot be predicted and genome-wide chromatin interactions cannot be obtained with these methods. Second, it is unclear whether ChIA-PET data can be predicted, as DeepTACT predicts promoter capture Hi-C data , and Akita, DeepC, and SEPT are restricted to Hi-C data. Third, none of these methods have been tested for their abilities to predict chromatin interactions de novo in patient cancer samples.
To overcome these challenges, in this study, we investigated the possibility of utilizing DNA sequence features to predict chromatin interactions between open chromatin regions, regardless of distance between them. Our study has several advantages. First, we demonstrated that open chromatin interactions can be predicted accurately from functional genomic data at the resolutions of the experimental techniques.
Second, we then developed a novel method, called chromatin interaction neural network (ChINN) to predict open chromatin interactions from DNA sequences. This model has been developed for RNA Polymerase II (RNA Pol II) ChIA-PET interactions, CTCF ChIA-PET interactions, and Hi-C interactions, overcoming previous limitations in terms of data input. Moreover, ChINN is able to identify open chromatin interactions in a genome-wide manner, overcoming the limitations of previous methods which were restricted to specific genomic regions.
Third, we extensively validated our method. ChINN was able to identify convergent CTCF motifs, AP-1 transcription family member motifs such as FOS, and other transcription factors such as MYC as being important in predicting chromatin interactions. Moreover, we further applied our model to a set of 6 newly generated chronic lymphocytic leukemia samples, which showed patient-specific chromatin interactions. We were able to validate predicted interactions by Hi-C and 4C. The models were then applied to a cohort of previously published 84 chronic lymphocytic leukemia (CLL) samples . Thus, we demonstrated the prediction power of our method in practice.
Fourth, we used ChINN to characterize the levels of open chromatin interaction heterogeneity in patient samples. While we found that many chromatin interactions are ubiquitous, we also found widespread evidence for patient-specific open chromatin interactions, and open chromatin interactions that were different in different subtypes of CLL.
Taken together, our results indicate both functional genomics models and ChINN can predict open chromatin interactions, and application of ChINN to cancer patient samples demonstrates widespread patient heterogeneity in chromatin interactions.
Open chromatin interactions can be predicted from functional genomic features
In light of Xi et al.  and our previous study  showing that the existing prediction methods have exaggerated performances, we first tried to demonstrate that chromatin interactions could be predicted from functional genomic data. Many previous studies focused on enhancer-promoter interactions that were annotated using chromatin interactions derived from Hi-C or ChIA-PET [15, 17, 19]. The enhancers used were typically hundreds of base pairs, while the chromatin interaction anchors were much larger in size. For example, Hi-C anchors are normally 5 to 100 kb long (while only in rare case with extremely deep sequencing are the anchors down to 1 kb size) [29, 30], and ChIA-PET is normally several kilobase pairs long [31, 32]. The resolution discrepancy could lead to the introduction of a lot of noises to the training datasets (Fig. 1a). Thus, we used the chromatin interaction anchors directly.
Positive samples were constructed from ChIA-PET datasets separately and the corresponding distance-matched negative datasets were generated (Additional file 1: Fig. S1). The resulting distance-matched datasets have positive-to-negative ratios of approximately 1:5 and all chromatin interactions were between open chromatin regions in the corresponding cell types (Additional file 1: Table S2). We used ChIP-seq data of transcription factors and histone modifications commonly available to GM12878, K562, and HelaS3 and DNase-seq data from ENCODE  to annotate the anchors and build the feature vectors (Additional file 1: Table S3). For each chromatin interaction, the average signal of each transcription factor, histone modification, and open chromatin were calculated for both anchors. The distance between two anchors was also used as a feature.
Gradient boosted trees  were used to build models for each dataset. We tested three feature sets: (1) all common functional genomics data and distance, (2) distance only, and (3) common functional genomics data only. A precision-recall curve (PR) curve shows the trade-off between precision and recall across different decision thresholds. The auPRC is calculated as the area under the PR curve. The models trained on all features achieved auPRC ranging from 0.62 to 0.77 (Fig. 1b), while models trained on distance are mostly at baseline (Fig. 1d), showing that distance is properly controlled between positive and negative samples. The models trained on functional genomics features achieved auPRCs ranging from 0.58 to 0.69 (Fig. 1c), lower than models trained on all features. These results showed that although distance alone cannot predict chromatin interactions, combining distance feature with other features together can help to distinguish the positive and negative chromatin interaction considering the working mechanisms of the GB model (Additional file 1: Text S1).
The across-sample performances were lower than within-sample performances (Fig. 1e). Using peak counts instead of signal values produced better across-sample performances but lower within-sample performances (Fig. 1f). Models trained on RNA Pol II datasets generalize well to each other. Models trained on CTCF ChIA-PET datasets, however, did not generalize well to each other. Models trained on CTCF ChIA-PET data perform poorly on RNA Pol II ChIA-PET datasets and vice versa.
Open chromatin interactions can be predicted from DNA sequences
In our previous section, we showed that open chromatin interactions can be predicted from functional genomics data, which consists of transcription factor data. As transcription factor binding can be predicted from sequences as shown by methods such as DeepSea (Zhou et al, Nature Methods, 2015) and DeepBind (Alipanahi et al, Nature Biotech, 2015), we reasoned that open chromatin interactions can be predicted from DNA sequences. Consequently, we went on to explore whether open chromatin interactions can be predicted from DNA sequences.
We built a convolutional neural network, ChINN, to predict chromatin interactions between open chromatin regions using DNA sequences (Fig. 2a). These models are called the ChINN sequence-based models. The models were trained on GM12878 CTCF, GM12878 RNA Pol II, HelaS3 CTCF, K562 RNA Pol II, and MCF-7 RNA Pol II datasets separately.
Compared to using functional genomics data for prediction, using sequences produced better within-sample performances for CTCF ChIA-PET datasets with auPRCs of 0.77 for GM12878 CTCF and 0.75 for HelaS3 CTCF (Fig. 2b), but worse within-sample performances for RNA Pol II ChIA-PET datasets with auPRC of 0.51 for GM12878 RNA Pol II, 0.6 for K562 RNA Pol II, and 0.47 for MCF-7 RNA Pol II. Including distance as a feature to classifier only slightly improved the performances for the distance-matched datasets (Fig. 2c). The across-sample performances of CTCF models showed well generalizability to each other (Fig. 2d). RNA Pol II models can also generalize to each other. Models trained on CTCF ChIA-PET datasets perform poorly on RNA Pol II ChIA-PET datasets and vice versa (Fig. 2d, e). The inability to generalize between CTCF chromatin interactions and RNA Pol II chromatin interactions could be attributed to the different sequence contexts.
For each model, we obtained and matched the position-weight matrices for all kernels on the first convolutional layer to known transcription factor binding motifs (Additional file 1: Fig. S2). As expected, CTCF motif was captured by both CTCF models. Other than the CTCF motif, the remaining known transcription factor binding motifs learned by the two models were different, indicating the possible cell-type-specific motifs. Our findings of the cell-type-specific motifs were supported by other pieces of evidence: studies show that cell-type-specific CTCF-mediated interactions are important in gene regulation [35, 36] and CTCF binding sites vary extensively across cell types [37, 38]. The patterns learned by RNA Pol II models showed more diversity and no matching transcription factor binding motif was shared among the three models. Interestingly, some of the transcription factors identified, such as ZNF143 in K562 and GATA3 in MCF-7, play important roles in the relevant cancer types [39, 40].
Besides, we also trained ChINN model on GM12878, HeLaS3, HMEC, HUVEC, IMR90, K562, KBM7, and NHEK Hi-C data, respectively. The auPRCs of within-sample performances using only sequences range from 0.52 to 0.77 for the above eight Hi-C models (Fig. 2f). Including distance as a feature to classifier only slightly improved the performances for the GM12878, HeLaS3, and NHEK Hi-C models (Fig. 2g). The across-sample performances of all eight Hi-C models showed well generalizability to each other (Fig. 2h, i).
Similarly, we obtained and matched the position-weight matrices for all kernels on the first convolutional layer to known transcription factor binding motifs for eight Hi-C datasets (Additional file 1: Table S4) and counted how many times each motif was detected (Additional file 1: Table S5). The CTCF motif was captured by all Hi-C models. The known transcription factor binding motifs learned by different Hi-C models were different. Some motifs, such as FOS, were learned by all models, but other motifs showed diversity, for example, ZN436 is detected by all other models except for HMEC, and ZIC3 is only detected by HeLaS3 (Additional file 1: Table S5). We noticed that the motifs detected in all cell lines exhibit smaller p-values than the cell-type-specific motifs, indicating that these “general” motifs are very important in predicting chromatin interactions. We speculate a model of chromatin interactions whereby there are general chromatin interactions facilitated by general transcription factors and common across different cell types, as well as cell-type specific chromatin interactions facilitated by cell-type specific transcription factors which can control cell-type specific transcription.
Convergent CTCF motifs are important for prediction of CTCF-associated open chromatin interactions
After extracting the sequence features from both the forward and reverse complement sequences of the anchors, the sequence features were fed into the classifier to obtain a probability score that indicated how likely the pair of anchors were involved in a chromatin interaction. We obtained the feature importance scores of the gradient boosted trees trained and validated using a set of extended datasets that includes more negative samples than the distance-matched datasets (Methods, Additional file 1: Fig. S3a-d). We noted that the PR curves of the datasets that used sequence features and distance (Additional file 1: Fig. S3a) were better than that of sequence features alone (Additional file 1: Fig. S3b). However, distance alone was uninformative by itself in predicting chromatin interactions (Additional file 1: Fig. S3c), suggesting that it is the combination of distance as a property in addition to sequence features that provide predictive power.
The sequence and distance-trained datasets were able to predict chromatin interactions across different cell types (Additional file 1: Fig. S3d). Consequently, we focused on the sequence features that were important for the prediction. As convergent CTCF motif has been observed in the anchor regions of CTCF loops [41,42,43], this suggests that the other sequence features or binding motifs at CTCF ChIA-PET anchors may also have such convergent orientation. Interestingly, in CTCF models the important sequence features were on different strands of the two anchors in a convergent manner (Fig. 3a, Additional file 1: Fig. S3e), while RNA Pol II models did not show such pattern (Fig. 3b, Additional file 1: Fig. S3f-g). For the CTCF models, importance scores of features on different strands of the two anchors showed good correlation, while importance scores of features on the same strand of the two anchors did not show much correlation (Fig. 3c). In contrast, the importance scores of features of RNA Pol II models were generally highly correlated regardless of the strand. These results are consistent with the previously observed convergent CTCF motifs at CTCF ChIA-PET and further suggest that other transcription factors also binds to CTCF loops in a similar manner.
The kernels on the last convolutional layer that generated the most important features in the extended CTCF models captured the CTCF motif (Additional file 1: Fig. S3h), suggesting that convergent CTCF motifs were important for the prediction of CTCF-associated chromatin interactions. However, using only CTCF motif information for the prediction of CTCF-associated open chromatin interactions could not recapitulate the performance achieved by the convolutional neural network (Additional file 1: Fig. S3i), indicating that CTCF was not the sole determining factor of chromatin interactions. We also showed the results when training with NN model (same NN structure as ChINN sequence-based models) on the same datasets using sequence and distance feature to illustrate the superiority of GB model here (Additional file 1: Fig. S3j).
Similarly, we trained gradient boosted trees with the corresponding extended datasets for eight Hi-C datasets. Distance was the largest contributor (in terms of feature importance score) when it was used together with sequence features (Additional file 1: Fig. S4a-d). But on its own, it was not very informative. This suggests its interaction with the sequence features is informative. When we visualized the sequence feature importance, although not as obvious as that of the CTCF models, we observed that the important sequence features were on different strands of the two anchors according to the corresponding mean values (Fig. 3d, Additional file 1: Fig. S4e). However, the importance scores of features did not show high correlation on Hi-C datasets (Fig. 3e). All the extended Hi-C models captured the CTCF motif via the kernels of the most important feature on the last convolutional layer (Additional file 1: Fig. S4f), indicating that convergent CTCF motifs were important for the prediction of Hi-C data chromatin interactions. The results trained with NN model using sequence and distance feature were also shown for reference (Additional file 1: Fig. S4g).
Predicting chromatin interactions from open chromatin regions
The above models were trained and evaluated on known chromatin interactions. Without knowledge of chromatin interactions, as is the case for many clinical samples and cell types, the locations of the anchors would not be known. To be able to predict chromatin interactions between open chromatin regions, the models need to be able to predict chromatin interactions between paired genomic regions (anchors) of open chromatin regions.
We tested different combinations of merging distances and extension sizes (Fig. 4a) based on validation datasets and determined that the merging distance of 3000 bp and extension size of 1000 bp for the construction of anchors in GM12878 cells (Additional file 1: Fig. S5a).
The pairs generated between anchors constructed from open chromatin regions in GM12878 were used to train gradient boosted trees for both CTCF and RNA Pol II models (see Methods). The positive-to-negative ratios were about 1:122 for CTCF chromatin interaction labeled samples and 1:186 for RNA Pol II chromatin interaction labeled samples. The CTCF model achieved within-sample auPRC of 0.514 and the RNA Pol II model achieved auPRC of 0.347 (Fig. 4b). In cross-sample evaluation, the CTCF model achieved auPRC of 0.359 on HelaS3 dataset and the RNA Pol II model achieved auPRCs of 0.232 and 0.164 on K562 and MCF-7 datasets, respectively (Fig. 4b). We were able to validate some of the predicted chromatin interactions in MCF-7 cells using 4C-seq (Additional file 1: Fig. S5b-d). Some of the validated chromatin interactions were not captured by the MCF-7 RNA Pol II ChIA-PET dataset, thus ChINN is able to identify bona fide chromatin interactions that might have been previously missed out due to insufficient sequence coverage.
We also generated pairs between anchors constructed from open chromatin regions in GM12878 and K562 Hi-C datasets with different combinations of merging distances and extension sizes (Additional file 1: Fig. S6a). We kept to the same parameters as the CTCF model, i.e., merging size of 3000 and extension size of 1000, to train gradient boosted trees due to the insignificant difference in auROC achieved by different parameters. The GM12878 and K562 Hi-C model had relatively low auPRC in the within-sample and cross-sample evaluation (Fig. 4c). However, we found that the auPRC of our ChINN method showed at least 4 times improvement over that of the random classifier. In cell line IMR90 tested by K562 model, ChINN showed as high as 57 times improvement. As there are a lot of data in the datasets (for example, the IMR90 dataset has 979,699 samples), these improvements in the auPRC could lead to many chromatin interactions being predicted correctly. Moreover, some of the predicted chromatin interactions in MCF-7 cells using 4C-seq were able to be validated by our Hi-C models (Additional file 1: Fig. S6b-d).
Other methods that predict Hi-C-like data and chromatin interactions are available, namely DeepTACT , SEPT , Akita , and DeepC . However, except for SEPT, the other three machine learning methods are very different from ChINN in terms of the data. For example, DeepTACT uses promoter capture Hi-C input data, which is quite different from the use of Hi-C and ChIA-PET input data. For Akita and DeepC, their output consists of Hi-C contact matrices on lists of user-specified genomic regions, while our output is chromatin loops and probabilities of interaction across the whole genome. As a consequence, we cannot call loops from these partial Hi-C matrices of Akita and DeepC, because we would not know the background genomic interaction distribution. Therefore, direct comparison between ChINN output and Akita/DeepC output is not possible.
As for the SEPT, following its pipeline, we extended the input sequences or cut to 3 or 2 kb flanking regions from the center. But SEPT performs worse as compared with ChINN on our dataset (AUPRC = 0.0016 evaluated on K562 Hi-C test datasets with HeLaS3 as source data), as the sequences in our dataset are longer than these input sequences. Therefore, cutting the sequence to 3 or 2 kb according to what SEPT does would not let the model learn much useful information.
We concluded that each method is designed to investing ate different questions, and in Additional file 1: Fig. S7, we summarized the decision making process for researchers who wish to use the different methods. For example, if the researcher is interested in promoter-promoter or promoter-enhancer interactions identified by PCHi-C (Promoter Capture Hi-C), they should use DeepTACT. If the researcher is interested in an output that is shown as a Hi-C heatmap, they should use Akita. If the researcher is interested to look at general chromatin interactions predicted from Hi-C data, or RNA Pol II and CTCF chromatin interactions predicted from ChIA-PET data, they should use ChINN. ChINN is the only machine learning method currently available for predicting Hi-C and ChIA-PET chromatin interactions from sequences with outputs specified as open chromatin associated chromatin interactions instead of Hi-C matrices.
Exploring chromatin interactions in patient samples
Next, we wished to apply our machine learning methods to patient samples to understand if our method could predict chromatin interactions in a completely new dataset. We obtained 6 chronic lymphocytic leukemia (CLL) patient samples. The clinical characteristics are described in Additional file 1: Table S6.
We prepared integrated Hi-C, ATAC-Seq, and RNA-Seq libraries from these 6 samples. We used Juicer to call topologically associated domains and loops from these patient samples. Our CLL samples showed many TADs and loops (Additional file 1: Table S7), thus indicating that we were able to perform Hi-C in these patient samples.
Next, we applied GM12878 and K562 Hi-C models to six new CLL samples. We used GM12878 and K562 Hi-C models for this prediction because CLL, GM12878, and K562 all come from hematopoietic lineages, and therefore GM12878 and K562 predictions would be likely to have captured both general chromatin interaction mechanisms and tissue-specific mechanisms that are relevant to hematopoietic cells.
The auPRC achieved by GM12878 Hi-C model range from 0.2772 to 0.4362, which are a bit higher than that of K562 Hi-C model, whose auPRC range from 0.2607 to 0.3996 (Fig. 5a). We calculated the F-score with different thresholds and finally determined the threshold of 0.025 for GM12878 model and 0.016 for K562 model to make the prediction on new CLL samples (Additional file 1: Fig. S8a-b), where the corresponding confusion matrix was shown as Fig. 5b and c.
One question we asked was whether there is patient heterogeneity in Hi-C data. We use “heterogeneity” to indicate that the chromatin interactions are different across patients. Clinical samples differ from each other due to a wide variety of factors including different driver mutations and different underlying genetics and epigenetics of each patient.
Here, we asked whether the subtype of the CLL samples could be one factor giving rise to patient heterogeneity. The CLL samples could be divided into two subtypes based on IGHV mutation status. In our data, two samples (102 and 344) are IGHV-unmutated CLL (uCLL) type and 4 samples (312, 324, 401, and 484) are IGHV-mutated CLL (mCLL) type. IGHV mutation status is an important prognostic biomarker in CLL, with mCLL being less aggressive .
Genomic sequences are almost identical across different patient samples, except for regions of patient-specific cancer structural variations and single nucleotide variations. Thus, if two anchors are identical in a different cell type, the probability that they are interacting given by the model will be the same. In this study, we have limited the scope of chromatin interaction prediction to only open chromatin regions, and call the predicted chromatin interactions “open chromatin interactions”.
Open chromatin profiles have been used to cluster cell types and cancer subtypes (Rendeiro et al, Nature Commun, 2016). With the assumption that the mechanisms of chromatin interactions are similar between different patient samples with the same cancer and with the patient-specific open chromatin regions, we explored the different chromatin interactions arising due to open chromatin differences between patient samples.
As a first step to investigate this question, we applied our ChINN framework on the six new CLL samples and built models using Hi-C and ATAC-seq data from each CLL sample. Models built using CLL samples would have captured general chromatin interaction mechanisms and tissue-specific mechanisms relevant to hematopoietic cells, as well as CLL-specific mechanisms.
Figure 5d showed the predicted chromatin interactions in 6 new CLL samples and the differences between uCLL and mCLL samples. With the selected threshold, a total of 152,202 Hi-C-associated open chromatin interactions were predicted (Fig. 5d) by GM12878 Hi-C model. We found extensive patient heterogeneity (Fig. 5e, f), as observed from the lack of similarity of chromatin interactions across the new CLL samples and the overlapping peaks between new CLL samples and GM12878 Hi-C peaks. For example, Fig. 5e indicates that 37% of Hi-C identified chromatin interactions can be found in only one sample, but not the other five samples. This indicates that many open chromatin interactions can only be found in one sample, and is an illustration of the level of heterogeneity in terms of the presence and absence of open chromatin regions and their associated chromatin interactions.
In addition, we also applied our ChINN framework on the six new CLL samples and built models using Hi-C and ATAC-seq data from each CLL sample. Our Hi-C libraries identified 1795 open chromatin interactions unique to uCLL samples and 10663 open chromatin interactions unique to mCLL samples (Fig. 6a). Moreover, uniqueness analysis of the Hi-C interactions from these six CLL samples similarly showed high patient heterogeneity (Fig. 6b). Thus, both predicted open chromatin interactions and Hi-C identified interactions indicate high patient heterogeneity.
These models have auPRC range from 0.37 to 0.58 (Fig. 6c). In addition, across-sample testing of these CLL models on other datasets from other CLL sample suggests a comparable performance (Fig. 6d). Inclusion of distance did not result in dramatic increase of the model performance (Additional file 1: Fig. S9a-9b). Moreover, the first convolutional layers of all CLL models were able to capture the CTCF and AP-1 transcription family member (FOS, JUN, JUNB, JUND) binding motif (Additional file 1: Fig. S9c), similar to the Hi-C models we showed earlier (Additional file 1: Fig. S4e; Additional file 1: Table S4-5).
After that, we trained gradient boosted trees with the corresponding extended datasets of the CLL samples. We observed that similar correlation of the important sequence features on different strands of the two anchors (Fig. 6e; Additional file 1: Fig. S9d-9e), although the within-sample and cross-sample auPRC were decreased (Additional file 1: Fig. S9f-9g).
We also generated open chromatin pairs using ATAC-seq to train the gradient boosted trees (merging size, 3000 bp; extension size, 1000 bp). Although the performances decreased compared with using Hi-C anchor region pairs as input, they were still higher than the random auPRC values (Additional file 1: Fig. S9h-9k). We further used the 401 CLL sample model to predict open chromatin interactions in MCF7 cells, as the 401 CLL model has the highest within-sample and across-sample performance. The predicted interactions correlate quite well with the real 4C-seq interactions (Fig. 6f, g, Additional file 1: Fig. S9l-9o, threshold = 0.016).
One question we asked was whether there is patient heterogeneity in Hi-C data. We first tried to associate the real and predicted Hi-C interactions with differentially expressed genes identified from RNA-seq data. The results showed that although the trend of different IFC scores (the fold change of the average number of open chromatin interactions observed at the gene promoter in uCLL samples over that in mCLL samples) could be observed, these differences were not significant (Additional file 1: Fig. S9p-9q). We also observed that the Hi-C interactions and ATAC-seq peaks in the new CLL samples showed high patient heterogeneity (Additional file 1: Fig. S9r). These patient heterogeneities may be a reason for the limited sample size in the IFC score analysis after we collapsed all six samples into mutated and unmutated categories (Additional file 1: Fig. S9p-9q).
Taken together, our results demonstrate across-sample prediction capability for the ChINN model. In addition, we observed high patient heterogeneity in the new CLL samples from the predicted open chromatin interactions as well as the Hi-C identified chromatin interactions.
Exploring open chromatin interactions in a cohort of patient samples
Next, we used our machine learning method to predict open chromatin interactions in a cohort of patient samples and then analyzed the data. We applied the above models to 84 chronic lymphocytic leukemia (CLL) samples whose open chromatin profiles were available by ATAC-seq . Among 84 CLL samples, 34 of them are uCLL type and 50 of them are mCLL type.
A total of 48,443 CTCF-associated open chromatin interactions and 23,633 RNA Pol II-associated open chromatin interactions were predicted based on the pooled open chromatin regions of all samples (Fig. 7a). RNA Pol II-associated chromatin interactions were better conserved across the CLL samples than CTCF-associated chromatin interactions (Fig. 7b), which could be attributed to that open chromatin regions in the CLL samples that overlapped with GM12878 RNA Pol II peaks were better conserved than those overlapping with GM12878 CTCF peaks (Fig. 7c). Using this set of ATAC-seq data in CLL samples, it was reported that regions with higher open chromatin signals in uCLL samples showed strong enrichment of binding sites of CTCF, RAD21 and SMC3 , which could also contribute to the high variability of CTCF chromatin interactions. Moreover, we again observed extensive patient heterogeneity of CTCF and RNA Pol II-associated predicted open chromatin interactions in these clinical samples.
When applying the GM12878 Hi-C model to the CLL samples, a total of 758,407 Hi-C-associated open chromatin interactions were predicted (Fig. 8a). The phenomenon observed from the CTCF model also can be observed from the Hi-C model, for example, the chromatin interactions across the CLL samples and the overlapping peaks between CLL samples and GM12878 Hi-C peaks were not well conserved as that of RNA Pol II (Fig. 8b, c). The predicted open chromatin interactions by Hi-C model were able to separate mCLL and uCLL samples (Additional file 1: Fig. S10a). Most differential chromatin interactions were associated with changes in the occurrence of one anchor (Fig. 8d). Genes that were upregulated in uCLL were associated with uCLL-specific chromatin interactions (Fig. 8e). In the set of differential chromatin interactions whose anchors did not have the same level of changes as the chromatin interactions themselves between the two subtypes, the rate of co-occurrences of the two anchors within the same sample and the levels in chromatin interactions could change (Additional file 1: Fig. S10b). Examples of predicted open chromatin interactions are shown in Fig. 8f and g and Additional file 1: Fig. S10e-h. Thus, we observed extensive patient heterogeneity of Hi-C predicted open chromatin interactions in these clinical samples.
Using the predicted open chromatin interactions, it was possible to separate mCLL and uCLL samples (Additional file 1: Fig. S11a). Variations in occurrences of open chromatin interactions between the two subtypes of CLL were associated with variations in occurrences of anchor regions. Most differential ChIA-PET chromatin interactions were associated with changes in the occurrence of one anchor (Fig. 7d). There was a small portion of differential chromatin interactions whose anchors did not have the same level of changes as the chromatin interactions themselves between the two subtypes. In this set of differential chromatin interactions, the rate of co-occurrences of the two anchors within the same sample could change, contributing to the levels of changes in predicted open chromatin interactions (Additional file 1: Fig. S11b). With the GM12878 Hi-C model, we were also able to see differences in connectivity at transcription start sites associated with differences in the occurrences of the open chromatin regions at the transcription start sites (Additional file 1: Fig. S10d).
Genes with higher expression in uCLL showed higher connectivity at the transcription start sites (Fig. 7e, Additional file 1: Fig. S11c, Fig. 6e, Additional file 1: Fig. S10c). The differences in connectivity at transcription start sites were associated with differences in the occurrences of the open chromatin regions at the transcription start sites between CLL subtypes (Additional file 1: Fig. S10d and Additional file 1: Fig. S11d), and also, differences in connectivity were sometimes associated with differences in distal interacting regions (Additional file 1: Fig. S11e, Fig. 7f). Examples of predicted open chromatin interactions are shown at important CLL prognostic markers, such as LPL (Fig. 7g), ZAP70 (Additional file 1: Fig. S11f), ZNF667 (Additional file 1: Fig. S11g), and CD38 (Additional file 1: Fig. S11h) [45,46,47,48]. Taken together, our results indicate that different subtypes show different profiles of predicted open chromatin interactions. Different subtypes may be a source of patient heterogeneity in clinical samples.
In this manuscript, we described two methods of predicting chromatin interactions, first, a functional genomics approach which uses local epigenomics data to accurately predict chromatin interactions, and second, a convolutional neural network, ChINN, which can extract sequence features and be coupled to classifiers to predict chromatin interactions between open chromatin regions using DNA sequences and distance.
We showed that at resolutions limited by the experimental techniques, chromatin interactions between open chromatin regions could be predicted from 1-dimensional functional genomics data through the fact that the cross-sample model can capture the chromatin interactions. ChINN only requires the use of open chromatin data and showed good generalizability on the same type of chromatin interactions across different cell types. Thus, it has the potential to be applied to large sets of clinical samples with limited biological materials. In addition, ChINN can discover sequence features that are important for predicting chromatin interactions, including shared features such as the CTCF motif and cell-type specific features such as GATA3 binding motif in MCF-7, which is frequently mutated in breast cancer . Also, we could validate ChINN-identified chromatin interactions by 4C.
In distance-controlled experiments, our prediction method using functional genomics data performed better on RNA Pol II chromatin interactions but worse on CTCF chromatin interactions compared to sequence-based ChINN. Such differences could be attributed to the lower functional genomic complexity at CTCF binding sites and functional genomic data might fail to capture the convergent CTCF binding motifs often observed at CTCF-mediated chromatin interactions.
We also noticed that the models trained using sequence features of CTCF ChIA-PET data perform better than the models trained using functional genomics data from CTCF ChIA-PET in the cross-sample prediction. We reason that the difference of the performance may be explained by the different resolution of the data. ChIP-seq data can yield ChIP-seq peaks of over several hundred bp long (and they are further generalized into count data when preparing the input features), while the CTCF motif is only less than 20 bp (CTCF motif MA0139.1 from JASPAR database). In addition, the ChIP-seq peaks cannot tell the orientation of the CTCF binding, while sequence can tell the direction of CTCF motif. As CTCF orientation is found to be important in chromatin interactions , the sequence feature can give more information of the binding site as well as the binding orientation. Therefore, the information of CTCF in these two cross-sample prediction results is different.
On the other hand, RNA Pol II binding sites do not have such distinctive DNA motifs, making it harder to predict RNA Pol II binding sites [13, 14] and consequently harder to predict RNA Pol II-associated chromatin interactions from DNA sequences. However, RNA Pol II binding sites are usually occupied by many other transcription factors, making it easier to predict RNA Pol II-associated chromatin interactions using functional genomic data.
The application of ChINN models with gradient boosted tree classifiers to a set of CLL ATAC-seq samples showed that several of the predicted open chromatin interactions could be validated by Hi-C. However, we note that the auPRC scores of ChINN reported, particularly the GM12878 and K562 models applied to explore chromatin interactions in patient samples, was around 0.26–0.6, which is consistent with cross-sample testing of other epigenomics machine learning prediction methods such as DeepHistone . However, these auPRC scores are not very high, which could be due to several reasons.
First, auPRC is a performance metric that is usually not very high, especially when the number of negative samples hugely overwhelm the positive samples. Because the number of chromatin interactions in the entire genome, relative to the number of genomic regions with no reported chromatin interactions by Hi-C, is not very high; therefore, the number of negative samples in our data hugely overwhelm the positive samples. As such, it is expected that the auPRC score will not be very high when applying GM12878 and K562 Hi-C models to other cell lines or patient samples.
Second, the ChINN method only takes as input the sequences of DNA at open chromatin regions of the genome for prediction. If more types of data are input into the model, the performance of the model is likely to improve, but at the cost of requiring more datatypes which are expensive and labor-intensive to acquire.
Third, while there were also chromatin interactions that were predicted but not validated by Hi-C, our results showing that 4C could validate predicted chromatin interactions in MCF-7 cells that were not identified by Hi-C suggest that these so-called “false positives” might potentially be real chromatin interactions that were simply not captured by Hi-C due to limited sequencing depth of Hi-C libraries.
In future work, further development of Hi-C and other chromatin interaction sequencing methods to comprehensively capture chromatin interactions will allow for a better comparison with chromatin interaction predictions. Additionally, further development and refinement of ChINN to improve the accuracy of chromatin interaction prediction is warranted.
Application of ChINN models in CLL revealed that although there were open chromatin interactions that were ubiquitous in all samples, there were a large number of patient-specific open chromatin interactions and also chromatin interactions that were found in fewer than half the samples. We note that chromatin interactions predicted using cross-sample models are likely to show less cell-type specificity, and the fact that sample heterogeneity can be seen in these predicted open chromatin interactions in spite of the lower likelihood of cell-type specificity due to the nature of the chromatin interaction prediction, suggests that chromatin interaction heterogeneity is widespread throughout the genome. Moreover, the observation of predicted open chromatin interaction heterogeneity agrees with our observations that there exist both ubiquitous chromatin interactions and patient-specific chromatin interactions in the 6 Hi-C libraries from the 6 CLL patient samples that we examined. While we previously observed patient-specific chromatin interactions at particular loci , here, we show that this phenomenon is widespread. To the best of our knowledge, this observation of widespread nature of patient-specific chromatin interactions is novel and has not been previously reported in the 3D genome organization field.
One potential reason for these different chromatin interactions could be due to different patient subtypes. Importantly, we found systematic differences in chromatin interactions involving important CLL prognostic genes, such as LPL and CD38, between the IGHV-mutated and IGHV-unmutated subtypes. These results suggest that differences in chromatin interaction landscapes between CLL subtypes could have important functional implications in CLL biology. Moreover, differences in chromatin interaction presence or absence may lead to different expression of oncogenes in cancers.
Our observation of widespread patient heterogeneity in patient cancer samples highlights the need for precision medicine and the need to understand chromatin interactions in individual patient samples. Machine learning offers one way for us to predict chromatin interactions in a cost-effective manner. The ChINN method may be useful in the future in understanding chromatin interactions in large cohorts of clinical samples and identifying chromatin interaction-based biomarkers that can be used to distinguish between different subtypes of cancer which may help in the development of precise therapies for the different subtypes of cancer.
A functional genomics approach is able to predict chromatin interactions. The ChINN framework is able to predict chromatin interactions from open chromatin regions in the human genome, using DNA sequences and distances as features. This framework can be applied in other cell lines or clinical samples given the knowledge of open chromatin regions, making it a useful tool to interrogate chromatin interactions when large-scale functional genomics acquisition is not applicable due to limited biological materials.
We performed machine learning, Hi-C interaction analysis, ATAC-seq, RNA-seq, and gene expression analyses as described in the following sections. The quality information of generated Hi-C, ATAC-seq, and RNA-seq libraries can be found in Additional file 1: Table S7, Additional file 1: Table S8, and Additional file 1: Fig. S12.
Machine learning of ChIA-PET data
The development of the sequence models was divided into three stages. In the first stage, the distance-matched datasets were used to train the models consist of convolutional neural network (feature extractor) with fully connected layers as the classier, as shown in Fig. 2a. The first stage deep learning method works as a feature extractor to convert the raw sequence feature to numerical representation that can be used as input of the machine learning models. Stages 2 and 3 aim to train the different machine learning models to make the prediction. In the second and third stage, the feature extractors trained in the first stage were frozen and gradient tree boosting classifiers were used as classifiers. In the second stage, the gradient tree boosting classifiers were trained using the extended datasets. In the third stage, the gradient tree boosting classifiers were trained using all potential pairs of anchors generated from open chromatin data and annotated by existing ChIA-PET data. Thus, the final result was a program that took in a list of open chromatin regions and produced predictions of chromatin interactions between the open chromatin regions.
The feature extractors took DNA sequences of both anchors of a potential interacting pair as input. The classier then took the features generated by the feature extractor and optionally the distance between anchors as input and produced a probability score of interaction. This final model was defined as the “from open chromatin” model. More details can be found from Additional file 1: Supplementary Methods [52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69].
Machine learning of Hi-C data from cell lines
We collected the Hi-C interactions from 8 cell lines, including GM12878, HeLaS3, HMEC, HUVEC, IMR90, K562, KBM7, and NHEK. The construction of machine learning model using Hi-C data from cell lines follows the same procedures as described in that of ChIA-PET data, where the positive data is annotated according to the Hi-C interactions.
Machine learning of Hi-C data from clinical samples
We collected the Hi-C interactions from 6 CLL clinical samples, including CLL 102, CLL 312, CLL 324, CLL 344, CLL 401, and CLL 484. The construction of machine learning model using Hi-C data from cell lines follows the same procedures as described in that of ChIA-PET data, where the positive data is annotated according to the Hi-C interactions. The CLL 401 model was used in the across-sample prediction.
Preparation of clinical samples
Chronic lymphocytic leukemia patient samples (either peripheral blood or bone marrow isolates) were obtained from the Leukemia Cell Bank at the National University Health System (NUHS) with patient consent, under Institute Review Board number H-20-022E. The CLL samples were either bone marrow aspirates (312,324,344,484 and 102) or peripheral blood (401). The samples were immediately frozen after collection and stored in liquid nitrogen until further use.
The samples were taken out of the liquid nitrogen and thawed by dipping in a beaker containing water at 37 °C. Once the sample was thawed completely, the cells were immediately transferred to the 15 ml falcon and resuspended in 10 ml PBS containing 2% fetal bovine serum (FBS) and 2 mM EDTA. The cells were pelleted at 300×g for 5 min at room temperature and resuspended in 5 ml PBS containing 2% FBS and 2 mM EDTA. The cells were counted and checked for viability using Trypan Blue.
RNA and genomic DNA were isolated from the CLL patient samples using AllPrep DNA/RNA/miRNA universal kit (Qiagen) according to the manufacturer’s instructions. Briefly, cells lysate were homogenized by a 21-G needle and syringe together with lysis buffer and 1 M DTT. After that, the homogenized lysate were transferred into AllPrep DNA mini spin column for genomic DNA extraction. The genomic DNA were then eluted by water and proceeded for the IGHV mutation test. The flow through after the AllPrep DNA mini spin column was then proceeded into RNease Mini spin column with on-column digestion for RNA extraction. The RNA were eluted in water and further sent for RNA-seq.
IGHV mutation test was performed following the method in Agathangelidis et al. . Briefly, IGHV-IGHD-IGHJ gene rearrangements were amplified by 5′ IGHV leader primers and 3′ IGHJ primers (primer sequences are provided in Additional file 1: Table S9) using genomic DNA (gDNA) from CLL patient samples. The PCR amplification was performed by PCR core kit (Qiagen). Final PCR products were imaged by agarose gel electrophoresis and purified by PCR purification kit (QIAGEN). Purified PCR products were confirmed through Sanger sequencing by 3′ IGHJ primers. The Sanger sequencing results were analyzed by IMGT/V-QUEST tools  to get the IGHV identity scores. If the identity score was larger than 98%, the CLL sample was considered an unmutated sample while if the score was lower than 98%, the CLL sample was considered a mutated sample.
In situ Hi-C
Hi-C libraries were prepared using the Arima Genomics kit (Arima Genomics, San Diego, CA) in conjunction with the Swift Biosciences Accel-NGS 2S Plus DNA Library Kit (Cat # 21024) and Swift Biosciences Indexing Kit (Cat # 26148) following the manufacturer’s recommendations. In brief, 1X 106 cells were fixed with formaldehyde in the nucleus. Fixed cells were permeabilized using a lysis buffer and then digested with a restriction enzyme cocktail supplied in the Arima Hi-C kit. The resulting overhangs were filled in with biotinylated nucleotides followed by ligation. After ligation, crosslinks were reversed, and the DNA was purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments. Hi-C material was then sonicated using a Covaris Focused-Ultrasonicator M220 instrument to achieve 300–500 bp fragment sizes. The sonicated DNA was double-size selected using Ampure XP beads, and the sequencing libraries were generated using low input Swift Biosciences Accel-NGS 2S Plus DNA Library Kit (Cat # 21024) and Swift Biosciences Indexing Kit (Cat # 26148). The Hi-C libraries were loaded on an Illumina flow cell for paired-end 150-nucleotide read length sequencing on the Illumina HiSeq 4000 following the manufacturer’s protocols.
MCF-7, a breast cancer cell line, was cultured in DMEM/F12 (Gibco) supplemented with 10% FBS and 1% penicillin-streptomycin and maintained at 37 °C, 5% CO2 humidified incubator. Before 4C-seq assays, MCF-7 cells were grown in hormone-free media: they were washed with PBS twice to remove any residual FBS or growth factors and incubated in phenol red-free medium (Invitrogen/Gibco) supplemented with 10% charcoal-dextran stripped FBS (Hyclone) and 1% pencillin-streptomycin for a minimum of 72 h. Hormone-depleted MCF-7 cells were then treated with estrogen (Sigma) to a final concentration of 100 nM for 45 min before 4C-seq assay. The control cells were treated with an equal volume and concentration of vehicle, ethanol (Sigma), for 45 min.
Circular chromosome conformation capture (4C)
4C-seq assays were performed according to Splinter et al  with slight modifications. Briefly, 4 × 107 cells were crosslinked with 1% formaldehyde. The nuclei pellets were isolated by cell lysis with cold lysis buffer (10 mM Tris-HCl, 10 mM NaCl, 5 mM EDTA, 0.5% NP 40) supplemented with protease inhibitors (Roche). First step digestion was performed overnight at 37 °C with HindIII enzyme (NEB). Digestion efficiency was measured by RT-qPCR with HindIII site-specific primers. After confirmation of good digestion efficiency, DNA was ligated overnight at 16 °C by T4 DNA ligase (Thermo Scientific) and de-crosslinked. Following de-crosslinking, DNA was extracted by phenol-chloroform and this is the 3C library. The DNA was then processed for second digestion with DpnII enzyme (NEB) overnight at 37 °C. After final ligation, 4C template DNA was obtained, and the concentration was determined using Qubit assays (Thermo Scientific). The 4C template DNA was then amplified using specific primers with Illumina Nextera adapters and sent for sequencing on the MiSeq system. All the 4C genome coordinates are listed in Additional file 1: Table S9.
Total RNA was extracted from the CLL samples using the All Prep DNA/RNA kit (Qiagen). The RNA was quantified using the Qubit BR RNA Assay kit. RNA-seq libraries (strand specific and ribo zero) were constructed using Illumina Total RNA Prep kit (Illumina, San Diego, CA, USA) and sequenced 150 bases paired-end on the Illumina HiSeq 4000 following the manufacturer’s instruction.
ATAC-seq library was prepared as described previously . Briefly, 50,000 cells were lysed for nuclei isolation using ATAC-Resuspention Buffer containing 0.1% NP40, 0.1% Tween-20, and 0.01% Digitonin. Transposition reaction was performed for 30 min at 37 °C using Nextera DNA library preparation kit (NEB). Transposed fragments were amplified by eight PCR cycles for library preparation. Primer dimers and long DNA fragments were removed by AMPure XP beads purification step. DNA concentration was measured by Qubit fluorometric assay and library quality was determined by Bioanalyzer. The library was sequenced in Nextseq 500 76 bp paired-end configuration using Illumina platform.
Availability of data and materials
All relevant data supporting the key findings of this study are available within the article and its Additional files.
The datasets of RNA-Seq, ATAC-Seq and Hi-C, which are generated during the current study are available in GEO under accession number GSE163896 . The 4C data has been deposited with GEO accession number GSE135052  and is publicly available.
The source codes of ChINN are freely available under Apache License 2.0 at https://github.com/mjflab/chinn . The code used for this paper has also been deposited at Zenodo with DOI 10.5281/zenodo.5139249 .
Babu D, Fullwood MJ. 3D genome organization in health and disease: emerging opportunities in cancer translational medicine. Nucleus. 2015;6:382–93.
Schottenfeld D: Gastrointesinal cancer: epidemiology. In Gastrointestinal oncology: principles and practice. Edited by Kelsen DP, JDaly JM, Levin B, Kern SE, Tepper JE. Philadelphia: Lippincott Williams and Wilkins; 2002
Akıncılar SC, Ekta K, Boon PLS, Unal B, Fullwood MJ, Tergaonkar V. Long-range chromatin interactions drive mutant TERT promoter activation. Cancer Disc. 2016;6(11):1276–91.
Li G, Ruan X, Auerbach RK, Sandhu KS, Zheng M, Wang P, et al. Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell. 2012;148:84–98.
Jin F, Li Y, Dixon JR, Selvaraj S, Ye Z, Lee AY, et al. A high-resolution map of the three-dimensional chromatin interactome in human cells. Nature. 2013;503:290–4.
Ma W, Ay F, Lee C, Gulsoy G, Deng X, Cook S, et al. Fine-scale chromatin interaction maps reveal the cis-regulatory landscape of human lincRNA genes. Nat Methods. 2015;12:71–8.
Dowen Jill M, Fan Zi P, Hnisz D, Ren G, Abraham Brian J, Zhang Lyndon N, et al. Control of cell identity genes occurs in insulated neighborhoods in mammalian chromosomes. Cell. 2014;159:374–87.
Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–93.
Fullwood MJ, Liu MH, Pan YF, Liu J, Xu H, Mohamed YB, et al. An oestrogen-receptor-alpha-bound human chromatin interactome. Nature. 2009;462:58–64.
Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485:376–80.
Lupianez DG, Kraft K, Heinrich V, Krawitz P, Brancati F, Klopocki E, et al. Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell. 2015;161:1012–25.
Guo Y, Xu Q, Canzio D, Shou J, Li J, Gorkin DU, et al. CRISPR inversion of CTCF sites alters genome topology and enhancer/promoter function. Cell. 2015;162:900–10.
Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015;12:931–4.
Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33:831–8.
Cao Q, Anyansi C, Hu X, Xu L, Xiong L, Tang W, et al. Reconstruction of enhancer-target networks in 935 samples of human primary cells, tissues and cell lines. Nat Genet. 2017;49:1428–36.
He B, Chen C, Teng L, Tan K. Global view of enhancer-promoter interactome in human cells. Proc Natl Acad Sci U S A. 2014;111:E2191–9.
Roy S, Siahpirani AF, Chasman D, Knaack S, Ay F, Stewart R, et al. A predictive modeling approach for cell line-specific long-range regulatory interactions. Nucleic Acids Res. 2015;43:8694–712.
Singh S, Yang Y, Poczos B, Ma J. Predicting enhancer-promoter interaction from genomic sequence with deep neural networks. bioRxiv. 2016:85241.
Whalen S, Truty RM, Pollard KS. Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat Genet. 2016;48:488–96.
Yang Y, Zhang R, Singh S, Ma J. Exploiting sequence-based features for predicting enhancer–promoter interactions. Bioinformatics. 2017;33:i252–60.
Zhu Y, Chen Z, Zhang K, Wang M, Medovoy D, Whitaker JW, et al. Constructing 3D interaction maps from 1D epigenomes. Nat Commun. 2016;7:10812.
Xi W, Beer MA. Local epigenomic state cannot discriminate interacting and non-interacting enhancer-promoter pairs with high accuracy. PLoS Comput Biol. 2018;14:e1006625.
Cao F, Fullwood MJ. Inflated performance measures in enhancer–promoter interaction-prediction methods. Nat Genet. 2019.
Fudenberg G, Kelley DR, Pollard KS. Predicting 3D genome folding from DNA sequence with Akita. Nat Methods. 2020;17:1111–7.
Li W, Wong WH, Jiang R. DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning. Nucleic Acids Res. 2019;47:e60.
Jing F, Zhang S-W, Zhang S. Prediction of enhancer–promoter interactions using the cross-cell type information and domain adversarial neural network. BMC Bioinformatics. 2020;21:1–16.
Schwessinger R, Gosden M, Downes D, Brown RC, Oudelaar AM, Telenius J, et al. DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat Methods. 2020;17:1118–24.
Rendeiro AF, Schmidl C, Strefford JC, Walewska R, Davis Z, Farlik M, et al. Chromatin accessibility maps of chronic lymphocytic leukaemia identify subtype-specific epigenome signatures and transcription regulatory networks. Nat Commun. 2016;7:11938.
Pal K, Forcato M, Ferrari F. Hi-C analysis: from data generation to integration. Biophys Rev. 2019;11:67–78.
Eagen KP. Principles of chromosome architecture revealed by Hi-C. Trends Biochem Sci. 2018;43:469–78.
Li G, Fullwood MJ, Xu H, Mulawadi FH, Velkov S, Vega V, et al. ChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing. Genome Biol. 2010;11:1–13.
Li G, Sun T, Chang H, Cai L, Hong P, Zhou Q. Chromatin interaction analysis with updated ChIA-PET Tool (V3). Genes. 2019;10:554.
Consortium EP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74.
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco: ACM; 2016. p. 785–94.
Hanssen LL, Kassouf MT, Oudelaar AM, Biggs D, Preece C, Downes DJ, et al. Tissue-specific CTCF–cohesin-mediated chromatin architecture delimits enhancer interactions and function in vivo. Nat Cell Biol. 2017;19:952–61.
Hou C, Dale R, Dean A. Cell type specificity of chromatin organization mediated by CTCF and cohesin. Proc Natl Acad Sci. 2010;107:3651–6.
Cuddapah S, Jothi R, Schones DE, Roh T-Y, Cui K, Zhao K. Global analysis of the insulator binding protein CTCF in chromatin barrier regions reveals demarcation of active and repressive domains. Genome Res. 2009;19:24–32.
Wang H, Maurano MT, Qu H, Varley KE, Gertz J, Pauli F, et al. Widespread plasticity in CTCF occupancy linked to DNA methylation. Genome Res. 2012;22:1680–8.
Gonzalez D, Luyten A, Bartholdy B, Zhou Q, Kardosova M, Ebralidze A, et al. ZNF143 protein is an important regulator of the myeloid transcription factor C/EBPalpha. J Biol Chem. 2017;292:18924–36.
Cimino-Mathews A, Subhawong AP, Illei PB, Sharma R, Halushka MK, Vang R, et al. GATA3 expression in breast carcinoma: utility in triple-negative, sarcomatoid, and metastatic carcinomas. Hum Pathol. 2013;44:1341–9.
Nichols MH, Corces VG. A CTCF code for 3D genome architecture. Cell. 2015;162:703–5.
Tang Z, Luo OJ, Li X, Zheng M, Zhu JJ, Szalaj P, et al. CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription. Cell. 2015;163:1611–27.
Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159:1665–80.
Rendeiro AF, Schmidl C, Strefford JC, Walewska R, Davis Z, Farlik M, et al. Chromatin accessibility maps of chronic lymphocytic leukaemia identify subtype-specific epigenome signatures and transcription regulatory networks. Nat Commun. 2016;7:1–12.
Kaderi MA, Kanduri M, Buhl AM, Sevov M, Cahill N, Gunnarsson R, et al. LPL is the strongest prognostic factor in a comparative analysis of RNA-based markers in early chronic lymphocytic leukemia. Haematologica. 2011;96:1153–60.
Morabito F, Cutrona G, Mosca L, D'Anca M, Matis S, Gentile M, et al. Surrogate molecular markers for IGHV mutational status in chronic lymphocytic leukemia for predicting time to first treatment. Leuk Res. 2015;39:840–5.
Rozovski U, Grgurevic S, Bueso-Ramos C, Harris DM, Li P, Liu Z, et al. Aberrant LPL expression, driven by STAT3, mediates free fatty acid metabolism in CLL cells. Mol Cancer Res. 2015;13:944–53.
Crespo M, Bosch F, Villamor N, Bellosillo B, Colomer D, Rozman M, et al. ZAP-70 expression as a surrogate for immunoglobulin-variable-region mutations in chronic lymphocytic leukemia. N Engl J Med. 2003;348:1764–75.
Takaku M, Grimm SA, Roberts JD, Chrysovergis K, Bennett BD, Myers P, et al. GATA3 zinc finger 2 mutations reprogram the breast cancer transcriptional network. Nat Commun. 2018;9:1059.
Yin Q, Wu M, Liu Q, Lv H, Jiang R. DeepHistone: a deep learning approach to predicting histone modifications. BMC Genomics. 2019;20:11–23.
Cao F, Fang Y, Tan HK, Goh Y, Choy JYH, Koh BTH, et al. Super-enhancers and broad H3K4me3 domains form complex gene regulatory circuits involving chromatin interactions. Sci Rep. 2017;7:1–14.
Shrikumar A, Greenside P, Kundaje A. Reverse-complement parameter sharing improves deep learning models for genomics. bioRxiv. 2017:103663.
Xu B, Wang N, Chen T, Li M: Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:150500853 2015.
Kulakovskiy IV, Vorontsov IE, Yevshin IS, Soboleva AV, Kasianov AS, Ashoor H, et al. HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models. Nucleic Acids Res. 2016;44:D116–25.
Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, et al. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 2014;42:D142–7.
Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8:1–9.
Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011;27:1017–8.
Lassmann T. TagDust2: a generic method to extract reads from sequencing data. BMC Bioinformatics. 2015;16:1–8.
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–9.
Thongjuea S, Stadhouders R, Grosveld FG, Soler E, Lenhard B. r3Cseq: an R/Bioconductor package for the discovery of long-range genomic interactions from chromosome conformation capture and next-generation sequencing data. Nucleic Acids Res. 2013;41:e132.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43:e47.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:1–21.
Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34:525–7.
Li H: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:13033997 2013.
Toolkit P: Broad institute, GitHub repository. See http://broadinstitute github io/picard 2019.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9.
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9:1–9.
Durand NC, Shamim MS, Machol I, Rao SS, Huntley MH, Lander ES, et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 2016;3:95–8.
Agathangelidis A, Sutton LA, Hadzidimitriou A, Tresoldi C, Langerak AW, Belessi C, et al. Immunoglobulin gene sequence analysis in chronic lymphocytic leukemia: from patient material to sequence interpretation. J Vis Exp. 2018.
Brochet X, Lefranc MP, Giudicelli V. IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res. 2008;36:W503–8.
Splinter E, de Wit E, van de Werken HJ, Klous P, de Laat W. Determining long-range chromatin interactions for selected genomic sites using 4C-seq technology: from fixation to computation. Methods. 2012;58:221–30.
Corces MR, Trevino AE, Hamilton EG, Greenside PG, Sinnott-Armstrong NA, Vesuna S, et al. An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues. Nat Methods. 2017;14:959–62.
Cao F, Zhang Y, Cai Y, Animesh A, Zhang Y, Akincilar SC, Loh YP, Li X, Chng WJ, Tergoankar V, et al: Chromatin interaction neural network (ChINN): a machine learning-based method for predicting chromatin interactions from DNA sequences. GSE163896 Gene Expression Omnibus https://www.ncbinlmnihgov/geo/query/acccgi?acc=GSE163896 2020.
Cao F, Zhang Y, Cai Y, Animesh A, Zhang Y, Akincilar SC, Loh YP, Li X, Chng WJ, Tergoankar V, et al: Chromatin interaction neural network (ChINN): a machine learning-based method for predicting chromatin interactions from DNA sequences. GSE135052 Gene Expression Omnibus https://wwwncbinlmnihgov/geo/query/acccgi?acc=GSE135052 2020.
Cao F, Zhang Y, Cai Y, Li X, Fullwood MJ: Chromatin interaction neural network. Github https://githubcom/mjflab/chinn 2021.
Cao F, Zhang Y, Cai Y, Fullwood MJ: Chromatin interaction neural network (Version 1.1). Zenodo http://doiorg/105281/zenodo5139249 2021.
We would like to thank all members of the Fullwood Lab for helpful comments. We would like to thank the NUHS Leukemia Cell Bank for providing Chronic Lymphocytic Leukemia samples.
The review history is available as Additional file 2.
Peer review information
Wenjing She was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
This research is supported by the National Research Foundation (NRF) Singapore through an NRF Fellowship awarded to M.J.F (NRF-NRFF2012-054) and NTU start-up funds awarded to M.J.F. This research is supported by the RNA Biology Center at the Cancer Science Institute of Singapore, NUS, as part of funding under the Singapore Ministry of Education Academic Research Fund Tier 3 awarded to Daniel Tenen as lead PI with M.J.F as co-investigator (MOE2014-T3-1-006). This research is supported by a National Research Foundation Competitive Research Programme grant awarded to V.T. as lead PI and M.J.F. as co-PI (NRF-CRP17-2017-02). This research is supported by the National Research Foundation Singapore and the Singapore Ministry of Education under its Research Centres of Excellence initiative. This research is supported by a Ministry of Education Tier II grant awarded to M.J.F (T2EP30120-0020).
Ethics approval and consent to participate
All samples were collected by the NUHS Leukemia Bank with patient consent. Ethics approval for the study was given by National University of Singapore Institutional Review Board (IRB approval number: LH-20-022E).
The authors confirm that experimental methods comply with the Helsinki Declaration.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Cao, F., Zhang, Y., Cai, Y. et al. Chromatin interaction neural network (ChINN): a machine learning-based method for predicting chromatin interactions from DNA sequences. Genome Biol 22, 226 (2021). https://doi.org/10.1186/s13059-021-02453-5
- Machine learning
- 3D genome organization
- Chromatin interactions
- DNA sequence