Chromatin interaction neural network (ChINN): a machine learning-based method for predicting chromatin interactions from DNA sequences

Chromatin interactions play important roles in regulating gene expression. However, the availability of genome-wide chromatin interaction data is limited. We develop a computational method, chromatin interaction neural network (ChINN), to predict chromatin interactions between open chromatin regions using only DNA sequences. ChINN predicts CTCF- and RNA polymerase II-associated and Hi-C chromatin interactions. ChINN shows good across-sample performances and captures various sequence features for chromatin interaction prediction. We apply ChINN to 6 chronic lymphocytic leukemia (CLL) patient samples and a published cohort of 84 CLL open chromatin samples. Our results demonstrate extensive heterogeneity in chromatin interactions among CLL patient samples. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02453-5.

1 Text S1: An explanation of the performance of distance feature The reason why genomic features along with distance feature can achieve significantly better performance than that using only genomic features can be explained from the Gradient boosted trees (GB) model that we used. The core of GB is decision tree. In a decision tree, the features would be selected as nodes to split the dataset. When using distance feature only, it is certainly impossible to split the dataset to positive or negative as there is only one split and the data are similar. But when using distance feature with other features together, the original datasets may already been split to small subsets by other features before using distance feature. In those small subsets split according to certain rules, distance feature may play some roles to split them to even smaller subsets. Besides, by adding one more feature, the pruning of the tree could also change, which may lead to different prediction results compared to without adding the feature.

Generation of positive and negative chromatin interactions
The GM12878 CTCF, GM12878 RNA Pol II, and Hela-S3 CTCF ChIA-PET datasets were downloaded from supplementary materials of Tang et al [1]. Alignment files of K562 RNA Pol II and MCF-7 RNA Pol II ChIA-PET datasets were obtained from Li et al [2] and processed as described in Li et al to call chromatin interactions.
Chromatin interactions that overlap with ENCODE blacklisted regions [3] at either anchor were discarded. Anchors of remaining chromatin interactions in each dataset were merged by allowing 500 bp gap [1]. The chromatin interactions in each dataset were clustered again using the merged anchors. Clustered chromatin interactions were filtered against DNase-seq data of the same cell line by requiring both anchors to overlap with DNase I hypersensitivity sites. The remaining clustered chromatin interactions were used as positive samples for each dataset.
For each dataset, corresponding negative samples were generated from four sources: Source 1. Pairs of clustered chromatin interaction anchors that were not directly or indirectly connected were used as negative samples. We build networks of merged anchors where the merged anchors were nodes and the chromatin interactions were edges. Two nodes were considered as directly connected if 4 there existed an edge between them. Two nodes were considered as indirectly connected if there existed a path of nodes and edges between the two nodes in the network.
Source 2. Random pairs of peaks from ChIP-seq data of the transcription factor from which the ChIA-PET library was built. The ChIP-Seq data and ChIA-PET data are from the same cell line. Peak pairs from CTCF ChIP-Seq data were used for CTCF ChIA-PET datasets and peak pairs from RNA Pol II ChIP-Seq data were used for RNA Pol II ChIA-PET datasets. The peaks pairs were generated as follows: (a) Transcription factor peaks that overlap with ENCODE blacklisted regions were filtered.
(b) From the remaining peaks, peaks that overlap with DNase I hypersensitivity regions were extracted.
(c) For each remaining peak, sample a size from a normal distribution. If the size of the peak is smaller than the sampled size, extend the peak to match the sampled size. The extended peaks were then merged. The mean and variance of the normal distribution were adjusted empirically to make the distribution of the sizes of merged peaks match the distribution of anchor sizes of the positive samples.
(d) All pairs of merged peaks with distance ranging from 5 kb to 2 Mb and were not directly or indirectly connected were kept.
Source 3. Random pairs of peaks from DNase-I data of the cell line from which the ChIA-PET library was built. The peak pairs were generated as follows: (a) DNase-I peaks that overlap with ENCODE blacklisted regions were filtered.
(b) For each remaining peak, sample a size from a normal distribution. If the size of the peak is smaller than the sampled size, extend the peak to match the sampled size. The extended peaks are then merged. The mean and variance of the normal distribution were adjusted empirically to make the distribution of the sizes of merged peaks match the distribution of anchor sizes of the positive samples.
(c) All pairs of merged peaks with distance ranging from 5 kb to 2 Mb and were not directly or indirectly connected were kept. In addition, this set of negative samples were filtered against the negative samples generated using transcription factor peaks to reduce redundancy among negative samples.
Source 4. Pairs of clustered chromatin interaction anchors that were indirectly connected.
We generated two types of negative datasets: distance-matched negative datasets and extended negative datasets.
For each positive dataset, a distance-matched negative dataset was generated by sampling negative samples from the first three sources such that the positive-tonegative ratio was about 1:5 for every chromosome and the distance distributions were matched (Table S2). We split the 5kb-2Mb distance range into 50 bins on the logarithm scale and counted the number of positive samples in each bin for each chromosome. Negative samples were then sampled such that there were about 5 negative samples for 1 positive sample in each bin of every chromosome. If the number of negative samples was less than 5 times of positive samples in a bin of a chromosome, more negative samples might be sampled from the same bin of another chromosome to make up the difference. Different types of negative samples were prioritized differently. Anchor pairs (Source 1) and transcription factor peak pairs (Source 2) were sampled first, and if there were not enough negative samples in these two sources, additional negative samples were sampled from DNase-I peak pairs (Source 3).
For each positive dataset, an extended negative dataset was generated by including pairs from Source 4, the corresponding distance-matched negative dataset, and the remaining negative samples from Source 1 and Source 2. This dataset is called the "extended dataset" in the main text.
Samples on chromosomes 5 and 14 were used for validation. Samples on chromosomes 4, 7, 8, and 11 were used for test. The rest samples were used for training. The validation sets were used to guide the selection of the trained models. The test sets were used to obtained the final performance metrics and were strictly not used in training or selection of models.
We note that here, "bin" does not refer to a segment of a chromosome; it is a segment of the distance distribution histogram.

Overview of the functional genomic models
The functional genomic models were based on gradient tree boosting classifiers with features generated from functional genomic data such as ChIP-seq data and DNaseseq data. We and others have shown that many of the "state-of-art" methods based on functional genomic data had highly exaggerated performances and thus whether chromatin interaction could be accurately predicted from functional genomic data 6 became unclear. This made our premise, chromatin interactions could be predicted from functional genomic data and functional genomic data could be predicted from DNA sequences, for developing predictive models based on DNA sequences broken. The purpose of developing the functional genomic models here was to confirm that chromatin interactions could be predicted from functional genomic data. For this purpose, we only trained and evaluated the models on distance-matched datasets.

Generating features for functional genomic models
ChIP-seq peaks of histone marks and transcription factors were downloaded from ENCODE [3]. Histone marks and transcription factors that were common to K562, GM12878, and HelaS3 cells and DNase-seq peaks were used (Table S1). For each sample, the feature values were calculated for both anchor independently. For an anchor a, let F a be its feature vector. Then the anchor's feature value of a transcription factor/histone modification/DNase f is calculated as where s a is the size of the anchor, o af i is the overlap size between the anchor a and peak i of factor f , and v f i is the average signal value of peak i of factor f . In short, a feature value of an anchor is the weighted sum of the signals of overlapping peaks of a factor, where the weights are the proportions of anchors in the overlaps.
The distance feature was calculated as the distance between the centers of two anchors.
We also evaluated models based on peak counts instead of signal values. When peak counts were used as features instead of signal values, we counted the peaks of a factor that overlap with an anchor and used the counts as feature values. In this case, a peak was only considered as overlapping with the anchor if the overlap size was at least 100 bp or at least 50% of the size of either the peak or the anchor.

Training and evaluation of functional genomic models
We built functional genomic models for GM12878 CTCF, GM12878 RNA Pol II, HelaS3 CTCF, and K562 RNA Pol II datasets individually. The functional genomic models were constructed using gradient boosted trees from the XGBoost library [5]. Parameters used were: max depth = 6, eta = 0.1, eval metric = logloss. The models were trained for 1000 iterations with early stopping when the performances on validation sets were not improved for 40 consecutive iterations. The iteration of the models that produced best performance on validation sets were used as the final trained model for each dataset. The performance metrics were then obtained by applying the final trained models on the test sets.

Overview of the sequence models
The development of the sequence models was divided into three stages. In the first stage, the distance-matched datasets were used to train the models consist of convolutional neural network (feature extractor) with fully-connected layers as the classifier, as shown in Figure 2a. In the second and third stage, the feature extractors trained in the first stage were frozen and gradient tree boosting classifiers were used as classifiers. In the second stage, the gradient tree boosting classifiers were trained using the extended datasets. In the third stage, the gradient tree boosting classifiers were trained using all potential pairs of anchors generated from open chromatin data and annotated by existing ChIA-PET data. Thus, the final result was a program that took in a list of open chromatin regions and produced predictions of chromatin interactions between the open chromatin regions.
The feature extractors took DNA sequences of both anchors of a potential interacting pair as input. The classifier then took the features generated by the feature extractor and optionally the distance between anchors as input and produced a probability score of interaction.

Preparing input for sequence models
We used the human genome assembly hg19 as our reference. The sequences of all anchors were extracted and converted to 4 × L one-hot coded matrices, where L was the length of the sequence. The four rows of the matrices represents the occurrences of nucleotides A, G, C, and T, respectively. Thus, nucleotide A was represented as [1, 0, 0, 0] T , nucleotide G was represented as [0, 1, 0, 0] T , nucleotide C was represented as [0, 0, 1, 0] T , and nucleotide T was represented as [0, 0, 0, 1] T . If N was encountered in the sequence, it was represented as [0.25, 0.25, 0.25, 0.25] T . For each sequence, we split it into 1000 bp subregions with 500 bp overlap between consecutive subregions.
The distance feature was calculated as the distance between the centers of two anchors.

The sequence model design considerations
A typical model based on convolutional neural network consists of two parts, the feature extracting convolutional layers (feature extractor) and the fully-connected layers (classifier) that aggregate the extract features and produce predictions or regressions. A convolutional layer is a stack of kernels, where each kernel is crosscorrelated (convoluted) with the input independently, producing measures of the local similarities between the input and kernel. Non-linearities such as rectifier linear unit (ReLU) are applied on the outputs of convolutional layers to break the linearity. Convolutional layers are usually followed by max pooling layers to both reduce the spatial dimension of the output and allow small translations of the features on spatial dimensions. Max pooling layers take local maxima of convolutional outputs of each kernel independently. The output of the last convolutional layer is usually flattened and fed into the fully-connected layers to make predictions or regressions.
In this study, we used a weight-sum layer [6] after the last convolutional layer to obtain an aggregated score of each kernel in the last convolutional layer. The weighted-sum layer learns a weight distribution for each channel in the input independently and calculates the weighted sum of each channel along the spatial dimensions. Intuitively, the weighted-sum layer learns to weight the detected features at different spatial regions. For example, a weight-sum layer could have high weights for features detected near the center of the input sequence while penalizes features detected in surrounding regions by giving them negative weights. The weight-sum layer is followed by a non-linear activation function such as tanh or sigmoid. We used tanh for CTCF datasets and sigmoid for RNA Pol II datasets.
The sizes of chromatin interactions anchors vary substantially, ranging from kilo basepairs to tens of kilo basepairs. Thus, the network needs to accommodate the different anchor sizes. For each anchor, we split it into 1000bp subregions with 500bp overlap between consecutive subregions. The sequences of the subregions were fed into the convolutional layers independently. Following the non-linear function after weight-sum layer, maxima of each kernel across all subregions of an anchor were taken as the feature value for the anchor. Consequently, each anchor, regardless of its size, is represented by the same number of features.

Mathematical formation of the components used
An anchor can be represented by a matrix X of shape N × L × C, where N is the number of subregions, L is the length of the subregions and C is the number of channels (or nucleotides in this application). For each kernel k in a convolutional layer, let S be the width of the kernel and W k of size S × C be the weight matrix of the kernel, b k be the bias, then where α is a small constant and is set to be 1/5.5 [7].
For a pooling layer, let S be the size of the pool and T be the stride, then for input X of shape N × L × C, For input X of N × L × C, the output of weightsum is of shape N × C. For a weight-sum layer, let W ws be the weight matrix of shape L × C, then For RNA Pol II datasets, each value in the output of weight-sum layer is transformed by For CTCF datasets, each value in the output of weight-sum layer is transformed by The features values of the anchor is then generated by taking the maximum of each kernel k across all subregions by maxf eature(X) k = N max n=1 X n,k .

Training and evaluation of the sequence models on distance-controlled datasets
For each dataset, a model was trained using samples from its training set. Two rounds of training were performed for each model and each round was run for 50 epochs and early stopped if the loss on validation data was not improved in consecutive 10 epochs. The first round of training used larger learning rates and weighted positive samples by the negative-to-positive ratio in the training set, which is around 5. The second round of training used smaller learning rates. For CTCF models, the weights for positive samples were removed in the second round of training. The performances of the models on validation datasets were calculated after each epoch and the model at the epoch that produced the best performance were selected. The final performance metrics were then calculated using the test sets.

Examining sequence features
The sequence patterns that kernels on the first and third convolutional layers detect were examined. For each kernel on first convolutional layer, as in DeepBind [8], we fed all test samples to the network and extracted a subsequence of each sample that pass the activation threshold of 0. If an anchor has multiple subsequences passing the activation threshold, the subsequence that has the highest activation is retained. Sequences that have no subsequences passing the activation threshold were discarded. The obtained subsequences for each kernel were then aligned to obtain a position-weight matrix for the kernel. The position-weight matrices of the kernels were compared against the transcription factor motif databases (HO-COMOCO version 11 [9] and JASPAR core 2014 vertebrates collection [10]) using TomTom [11] with E-value threshold of 0.05. For each kernel on the third convolutional layer, we generated an input matrix that maximize the output from the kernel. An input matrix was initialized randomly and fed into the network. The activation from the kernel under examination was obtained and the negation of its output was used as the loss to calculate the gradients on the input. The input matrix was then updated with stochastic gradient descent. The values of each column in the input matrix, which represents a base pair, were first clamped to the range of [0, 1] and then normalized so that the sum of the values was 1. For each kernel, 20,000 iterations were performed. The resulting input matrices were then used as position-weight matrices. As these position-weight matrices are 163 bp and TomTom takes long time to compare them to transcription factor motif databases, we split them into 30 bp smaller matrices with 20 bp stride. These smaller matrices were compared to transcription factor motif databases (HOCOMOCO version 11 [9] and JASPAR core 2014 vertebrates collection [10]) using TomTom [11] with E-value threshold of 0.01.

Training gradient boosted trees using extended datasets
The samples in the extended datasets were converted to matrices as described above. The convolutional layers trained using the distance-matched datasets were frozen and were used to produce feature vectors for the samples in the extended datasets. The generated feature vectors were used to train gradient tree boosting classifiers together with the distances between anchors. The splitting of samples into training, validation, and test sets were the same as described above.
We used gradient boosted trees from XGBoost library [5]. The max depth setting was set to be 10, and trained for 1000 iterations and early stopped of if the loss on validation dataset is not improved in 40 consecutive iterations. The reported performance metrics were based on the test sets.
The classifiers on the five datasets achieved within-sample auPRCs ranging from 0.4 to 0.6 (Supplementary Figure 3a). If distance was not used, the performances of the classifiers dropped significantly (Supplementary Figure 3b). However, using distance alone could not predict chromatin interactions (Supplementary Figure 3c). These results suggested that the interplay between distance and sequence features were important in predicting chromatin interactions in general. The across-sample performances of the models on extended datasets were comparable to the corresponding within-sample performances, demonstrating the generalizability of the predictive models to new cell types using only sequence features and distance (Supplementary Figure 3d).

CTCF-only models
We obtained all occurrences of CTCF motif in the genome using FIMO [12]. For each strand of each anchor, sum of matching scores of strand-specific occurrences of the CTCF motif within the anchor was calculated. The values were then normalized by the size of the anchors. Thus, for each chromatin interaction, a feature vector of four values were generated, each representing a strand of an anchor. Using these features together with distance, we trained another set of gradient boosted trees on the extended CTCF datasets.

Training gradient boosted trees using pairs of open chromatin regions in GM12878
Two parameters, merging distance and extension size, were considered in determining how to generate anchors from DNase I hypersensitivity sites. As shown in Figure 4a, merging distance determines the gap allowed between neighboring peaks to merge them and extension size determines how much to extend on both ends of the merged peaks. To decide the two parameters, we used different combinations of the two parameters to process the DNase I peaks to generate anchors. Anchor pairs were then generated from the list of anchors such that the distance between centers of the two anchors were in the range of 5 kb to 2 Mb. Anchor pairs overlapping with ChIA-PET chromatin interactions were labelled as positive and the rest were labelled as negative. We then evaluated the performances of the models trained on extended datasets achieved on these anchor pairs for all combinations of the two parameters. The combination of the two parameters that achieved best area under ROC curve was selected. The reason to use auROC is because merging using different merging distances could lead to different number of anchor pairs generated and potentially different positive-to-negative ratios. Positive-to-negative ratios could affect auPRC measures and F1 measures as their baselines are sensitive to positive-to-negative ratio.
The DNase-I hypersensitivity sites were processed using the selected merging distance and extension size to produce anchors. Pairs of anchors were generated and labelled as described above. The anchor pairs were passed through the feature extractor to extract sequence features and generate feature vectors of sequence features. The sequence feature vectors and distance between anchors were then used to train gradient boosted tree classifiers. The parameters were then same as used for the extended datasets. The splitting of samples into training, validation and test sets followed the same rule as described above.
The trained classifiers produced a probability for each anchor pair. We noted that using 0.5 as the cutoff to call positive pairs did not produce optimal F1 scores. Thus, based on the F1 scores on the validation sets, we selected the cutoffs that produced the best F1 scores for each dataset. We obtained the predicted probabilities scores for validation datasets, the samples with scores larger or equal than the cut-off value were assigned as predicted to be positive; the samples with scores smaller than the cut-off value were assigned as predicted to be negative. The F1 scores were calculated using Python function sklearn.metrics.f1 score with different selection of cut-off values (Table S10). The selected probability cutoffs were 0.22 for CTCF and 0.2 for RNA Pol II chromatin interaction predictions, respectively.
2.14 4C-seq library generation MCF-7, a breast cancer cell line, was cultured in DMEM/F12 supplemented with 10% FBS and 1% penicillin-streptomycin and maintained at 37 • C, 5% CO 2 humidified incubator. MCF-7 cells were grown in hormone-free media: they were washed with PBS and incubated in phenol red-free medium (Invitrogen/Gibco) supplemented with 10% charcoal-dextran stripped FBS (Hyclone) and 1% pencillin-streptomycin for a minimum of 72 hours. Hormone-depleted MCF-7 cells were treated with oestrogen to a final concentration of 100 nM for 45 mins before 4C assay. The control cells were treated with an equal volume and concentration of vehicle, ethanol (ET, Merck), for 45 min. In Supplementary Figure 5b For both MCF-7 and K562 cell lines, 4C-seq assays were performed as previously described in Splinter et al. [13] with slight modifications. Briefly, 4 × 10 7 cells were cross-linked with 1% formaldehyde and the nuclei pellets were isolated after cell lysis with cold lysis buffer supplemented with protease inhibitors. First step digestion was performed overnight at 37 • C with HindIII enzyme. Digestion efficiency was measured by RT-qPCR with HindIII site-specific primers. After phenol-chloroform extraction, DNA was ligated overnight at 16 • C by T4 DNA lig-ase. Following de-crosslinking, DNA was processed for second digestion with DpnII enzyme overnight at 37 • C. After final ligation, 4C template DNA concentration was determined using fluorescence assay (picogreen, Invitrogen) and proceeded for library preparation for MiSeq sequencing using specific primers with Illumina Nextera adapters.

4C-seq data analysis
Primer sequences at 5' ends of reads were trimmed using Tagdust2 [14]. Extracted reads were then mapped to reference genome (hg19) using Bowtie2 (2.2.6) [15]. Unaligned reads were collected. The first 50 base pairs of the unaligned reads were extracted and realigned to the genome using Bowtie2 (2.2.6). The uniquely aligned reads (MAPQ≥ 30) were combined. R Package r3Cseq [16] was used to call significant interactions using non-overlapping window approach. 5kb windows were used.

RT-qPCR experiments
Total RNA were isolated from the cells using RNeasy Mini Kit (Qiagen) with on-column DNase digestion (Qiagen). 1µg of total RNA was then reverse transcribed to cDNA using the SuperScript III first-strand synthesis system using oligodT (Invitrogen). The expression levels of various genes were analysed by real-time PCR. Quantitative real-time PCR (qPCR) was performed on Applied Biosystems QuantStudio 3 Real-Time PCR system using SYBR Green PCR Master Mix and appropriate primers. The transcript levels of genes were analysed by 2 −∆∆Ct method.

Analysis of chronic lymphocytic leukemia samples
ATAC-seq peaks of chronic lymphocytic leukemia (CLL) samples were collected from Gene Expression Omnibus (GSE81274). Peaks from the CLL samples were pooled and merged with merging distance of 3000 bp. The merged peaks were then extended by 1000 bp at both sides. All potential pairs of merged peaks that were on the same chromosome and had separation (as measured by the distance between centers of the two peaks) in the range of 5 kb to 2 Mb were used. The pairs were then fed into the CTCF and RNA Pol II CHINN models with gradient boosted trees trained on genome-wide open chromatin pairs to obtain probabilities of interaction. The probability cutoffs were 0.22 for CTCF chromatin interactions and 0.2 for RNA Pol II chromatin interactions, which were selected for the gradient boosted trees trained on genome-wide open chromatin pairs as described above. For each CLL sample, its CTCF chromatin interactions and RNA Pol II chromatin 14 interactions were the predicted chromatin interactions whose both anchors overlap with peaks in the ATAC-seq library of the sample.
Random forest classifiers were trained using predicted CTCF and RNA Pol II chromatin interactions, respectively, to classify the samples into IGHV-mutated (mCLL) and IGHV-unmutated (uCLL) samples. Each sample is represented by a binary feature vector of the predicted chromatin interactions, where each value indicates whether the predicted chromatin interaction is present in the sample. We used the scikit-learn [17] implementation of random forest classifier with the parameter n estimator set to 500. The performances were obtained by using 4-fold cross-validation with samples from the same patient always in the same group. Top 1000 important features for each trained random forest classifiers were then used for hierarchical clustering. Label-shuffled performances were obtained by training random forest classifiers on label-shuffled samples with the rest remain the same.
Differential chromatin interactions were obtained by using Fisher's Exact test to evaluate whether a chromatin interaction was more associated with any of the two CLL subtypes. The p-value threshold used was 0.01. The same Fisher's Exact test was applied to anchor peaks of the predicted chromatin interactions to test whether they were also more associated with any subtype of CLL. Differential chromatin interactions with both anchor peaks having p-values smaller than 0.01 were categorized as 'Both', those with only one anchor peak having p-value smaller than 0.01 were categorized as 'One-side', and those having neither anchor peak with p-value smaller than 0.01 were categorized as 'Neither'.
Two sets of CLL gene expression data were used: 1) a microarray dataset containing 76 mCLL samples and 58 uCLL samples from Gene Expression Omnibus (GSE69034); 2) a RNA-seq dataset containing 5 mCLL and 5 uCLL samples as an expression matrix downloaded from Gene Expression Omnibus (GSE81274). The RNA-seq dataset was from the same patient cohort as the ATAC-seq data. For the microarray dataset, limma [18] was used for differential gene expression analysis and genes with adjusted p-value smaller than 0.01 were selected. For the RNA-seq dataset, DESeq2 [19] was used for differential gene expression analysis and genes with adjusted p-value smaller than 0.05 were selected.
Each predicted chromatin interaction i and each open chromatin region o was assigned a number p s i or p s o that indicates the percentage of samples of a subtype s it appeared in. The connectivity c s t of a transcription start site t in a subtype s was measured by taking the average of the percentages of samples in which its associated chromatin interactions appeared, which is , where I is the set of predicted chromatin interactions that overlapped with the transcription start site s. The presence of a transcription start site in a subtype was measured by the open chromatin region that was closest to the transcription start site. We then checked the fold changes of connectivity and presence of each transcription start site between uCLL and mCLL subtypes.
The Hi-C, RNA-seq, and ATAC-seq data of the six new CLL samples were deposited in GSE163896. For RNA-seq, kallisto (0.46.0) [20] was used to quantify the abundance of transcripts. The reference transcriptome used was Ensembl release 75. kallisto reports transcripts per million (TPM) for the transcripts. The TPM values were not nromalized across samples. For ATAC-seq, reads were aligned as paired-ends reads to hg19 genome assembly using BWA (0.7.17-r1188) [21]. The command used was bwa mem -M. Duplicates were marked and removed using MarkDuplicates function of Picard (2.22.2) [22] from GATK. Paired-end reads with either end unmapped were filtered using samtools (1.6) [23]. macs2 (2.2.6) [24] was used for peak calling with -BAMPE option. For Hi-C, BWA (0.7.17-r1188) [21] was used for aligning the paired-end reads to hg19 genome assembly. The command used was bwa mem -SP5M. The ENCODE Hi-C pipeline was used for filtering Hi-C alignments. Unmapped pairs and pairs with low quality mapping were removed. The pairs were deduplicated based on the mapping location of both ends. HiC-CUPS from Juicer Tools (1.9.12) [25] was used for calling loops. Contact domains were detected using Arrowhead from Juicer Tools (1.9.12) [25].   3.4 Fig. S4: Hi-C sequence models on extended datasets        3.12 Fig. S12: Per base quality of CLL RNA-seq libraries           Tables   40   Table S1 Method