Estimating DNA-DNA interaction frequency from Hi-C data at restriction-fragment resolution

Hi-C is a popular technique to map three-dimensional chromosome conformation by capturing the frequency of physical contacts between pairs of genomic regions in cell populations. Although the resolution of Hi-C data is in principle only limited by the size of restriction fragments (300 bp - 4 kb), stochastic noise caused by the limited sequencing coverage forces researchers to artificially reduce the resolution of Hi-C matrices by binning the genome into 5-100 kb regions, resulting in a loss of information and biological interpretability. Here, we present the Hi-C Interaction Frequency Inference (HIFI) algorithms, a family of computational approaches that takes advantage of dependencies between neighboring restriction fragments to estimate restriction-fragment resolution interaction frequency matrices from Hi-C data. HIFI is shown to be superior to existing fixed-binning and state-of-the-art approaches via cross-validation experiments on Hi-C data and comparisons to 5C data. It also greatly improves the delineation of enhancer-promoter contacts. Finally, the high resolution afforded by HIFI reveals a new role for active regulatory regions in structuring topologically associating domains (TADs) and subTADs. By operating upstream of many Hi-C data analysis tools (e.g., normalization tools, as well as loop, TAD, and compartment predictors), HIFI will be easily inserted into a number of Hi-C data analysis pipelines, enabling a variety of high-resolution genomic organization analyses. Availability github.com/BlanchetteLab/HIFI


Introduction
Cells are complex, dynamic environments that require constant regulation of their genes to ensure survival. The advent of chromosome conformation capture (3C) technologies (1), and recent advances in imaging techniques (2), have led to an improved understanding of genome organization and its role in gene regulation (3,4). Hi-C (5), a highthroughput derivative of 3C, provides an unparalleled view of three-dimensional (3D) genome organization by capturing all DNA-DNA contacts found within a cell population. Hi-C has revealed different levels of genome organization, including the topologically associating domains (TADs (6, 7) and subTADs (8,9)) and chromatin compartments (5). Yet, the potential for a more refined understanding of 3D genome organization remains largely untapped (10).
In a Hi-C experiment, cross-linked chromatin is digested into fragments using a restriction enzyme (RE). Restriction fragments (RF) are then proximity-ligated to obtain a library of chimeric circular DNA. Paired-end sequencing and mapping of reads to a reference genome identifies interacting RFs and their frequency count. The data is conventionally stored as a pairwise read count matrix, RC, where RC i,j is the number of observed interactions (read-pair count) between genomic regions i and j. Despite the great sequencing depth of typical Hi-C experiments (200-500 million read pairs), RFresolution RC matrices are extremely sparse, with most RF pairs being observed either zero or one time. This sparsity makes observations of individual contacts between RF pairs inherently stochastic and unreliable. Increasing sequencing coverage is a partial solution, but without improved bioinformatics analyses, the depth of sequencing needed to make reliable estimates of RC i,j for individual RFs is unmanageable. For this reason, Hi-C data is rarely studied at RF resolution, but instead binned at fixed intervals (e.g., every 25 kb) to produce interaction frequency (IF) matrices. Unfortunately, reducing the resolution of a Hi-C IF matrix leads to difficulties in studying interactions between fine-scale genomic elements such as promoters and enhancers.
To improve the resolution of Hi-C data, recent protocols suggest digesting DNA more finely, either with a four-cutter RE (10,11) or DNAse I (12), followed by binning at 1 to 5 kb. While these methodologies increase the resolution of a Hi-C IF matrix, they actually worsen the problem of sparsity and stochastic noise. For example, using a 4-cutter RE instead of a 6-cutter results in a 16-fold increase in the number of RFs, and a 256-fold increase in RF pairs. This problem can be alleviated by using DNA capture technologies to concentrate sequencing on a predefined set of loci (13,14), but this approach loses the ability to interrogate the wholegenome conformation in a hypothesis-free manner. Instead, new bioinformatics approaches have been proposed to detect individual significant contacts at high resolution from Hi-C data (15,16), and a machine-learning method has been introduced to smooth Hi-C matrices at 10 kb resolution (17). Dynamic binning was also proposed as a way to adjust bin size to ensure even read coverage across the genome, enabling locally higher resolution (18). However, no approach currently exists to obtain complete and accurate IF matrices at RF resolution. Such an approach would be valuable as it would allow researchers to revisit existing datasets and get more information out of them without having to change experimental protocols or generate more experimental data. Here, we introduce the Hi-C Interaction Frequency Inference (HIFI) algorithms, a family of computational approaches that provide reliable estimates of IFs at RF resolution. HIFI algorithms reduce stochastic noise, while retaining the highestpossible resolution, by taking advantage of dependencies between neighboring RFs. We validate these algorithms via cross-validation and a comparison to observations made by independent chromosome conformation assays. We further demonstrate that HIFI greatly improves the detection of contacts between promoters and enhancers. Finally, we illustrate additional benefits of high-resolution Hi-C data analysis by using it to study how active regulatory regions are involved in structuring TADs and subTADs.

Results
Hi-C interaction frequency inference. HIFI algorithms aim to reliably estimate Hi-C contact frequencies between all intra-chromosomal pairs of restriction fragments. The output of a HIFI algorithm is an IF matrix, where each entry (i, j) corresponds to the IF of the RFs from row i and column j. As REs do not digest DNA uniformly along the genome, different rows/columns correspond to regions of different sizes. Depending on the RE used, the achievable resolution of Hi-C ranges on average from~400 bp (for a four-cutter such as MboI) to~3.7 kb (for a six-cutter such as HindIII). The highresolution analysis of Hi-C data faces multiple challenges, of which the sparsity of the observed read-pair data is the most significant. For example, a Hi-C experiment with a very high sequencing depth of one billion read-pairs will yield on average approximately 0.1 read-pairs per intrachromosomal matrix entry for a 6-cutter RE, and less than 0.001 for a 4-cutter RE. This sparsity results in the observed read-pair count for a given RF pair being a poor (high-variance) estimator of the true IF, except for rare RF pairs located in regions of the Hi-C contact map where IF values are extremely high. All existing solutions to this problem, including the methods introduced in this paper, take advantage of the fact that IF of neighboring entries in the IF matrix are strongly correlated. In particular, the most common approach to the resolution/accuracy trade off is to artificially reduce the resolution by binning the raw data to fixed-size intervals (e.g., 25 kb bins). This lower resolution increases the number of reads per bin pair, and thus allows for a more reliable estimation of IF, but at the cost of a loss in biological interpretability. Importantly, no unique bin size is uniformly ideal for an entire IF matrix. Portions of an IF matrix where high IFs are present could support a highresolution analysis, whereas others, corresponding to lower IF values, may require larger bins for accurate IF estimation. This is the key motivation behind the family of RF-resolution approaches presented here. More specifically, the problem addressed here is the following: consider a Hi-C dataset H produced with a given restric-tion enzyme e. For a given chromosome, the raw outcome is stored in an n × n intrachromosomal matrix RC, where n is the number of RFs produced by e, and RC i,j contains the number of read-pairs mapped to RF pair (i, j). Our goal is to estimate as accurately as possible the true RF-level interaction frequency matrix, IF true , which is the theoretical n × n IF matrix one would obtain if one were to sequence an infinitely large version of H to infinite depth. IF true is affected by a number of library, sequencing, and mapping biases that would need to be corrected in order to allow for proper biological interpretation; many such normalization techniques already exist for this task (19)(20)(21). Our goal here is not to improve upon these techniques, but to work upstream and provide the most accurate estimate of IF true . Four approaches are introduced and assessed (see Methods for details), each taking as input matrix RC and producing as output an estimate of IF true : 3. An Adaptive Kernel Density Estimation (HIFI-AKDE) approach, where the bandwidth is chosen dynamically for each matrix entry in order to ensure that a sufficient number of read-pairs is available for reliable IF estimation, while maximizing the resolution.
4. An approach based on Markov random fields (HIFI-MRF) where dependencies between neighboring cells are modeled and used to identify the maximum a posteriori estimate of IF true .
Assessing the accuracy of high-resolution IF inference algorithms is challenging because IF true is unknown, as Hi-C datasets of infinite sequencing depths are not achievable. Instead we consider two surrogates. First, we use a crossvalidation approach from existing Hi-C data. Second, we assess predictions against data produced by Chromosome Conformation Capture Carbon Copy (5C (22)), a targeted amplification protocol that achieves a much higher read count per RF pair compared to Hi-C.
Cross-validation of HIFI algorithms. We used crossvalidation to assess the accuracy of HIFI algorithms genomewide. Here, a Hi-C read-pair dataset of high sequencing depth produced by Rao et al. (2014) from GM12878 cells using HindIII was first filtered to retain only high-confidence intra-chromosomal read pairs, and then randomly partitioned into an input set (containing 80% of the set of filtered read pairs, or 607,587,043 read pairs), and a test set (20%, or 151,979,454 read-pairs) (Fig. 1A). The input set is then further down-sampled into seven subsets ranging in size from 1 to 100% of the full input set. Mapping and tabulating readpairs at RF-level resolution yields a family of IF matrices: RC input_1 , RC input_2 , . . ., RC input_100 , and RC test . Each of the four inference algorithms are evaluated by their application to each of the down-sampled input matrices to obtain a predicted IF matrix, IF pred , which is then compared to the test matrix RC test to obtain the sum of squared errors: Although RC test is not equal to IF true , the inference approach that minimizes SSE(IF pred , RC test ) is also the one that minimizes SSE(IF pred , IF true ), and hence, this serves as a valid basis for comparison. Fig. 1B shows that the accuracy (SSE) of fixed-binning strategies improves with input set size and that the optimal accuracy is obtained at different bin sizes for different input set sizes: large bins are ideal for low-coverage training data, whereas smaller bins are better with high-coverage data. More importantly, the fact that read-pairs are highly nonuniformly distributed in RC matrices means that the ideal bin size differs depending on the local RC density. In particular, short-range contacts, which typically have higher RC values, can support high-resolution analyses (smaller bins), but those at longer ranges are best estimated with larger bins (Fig. 1C). The HIFI-KDE approach with a fixed bandwidth generally obtains better results (Suppl. Fig. 1A, B), but suffers from the same type of problem, where optimal results are obtained with large bandwidth values for low-coverage datasets and lower bandwidth values for high-coverage. The HIFI-AKDE approach, where different bandwidth values are chosen at each cell based on the surrounding signal density, clearly outperform the first two approaches (Fig. 1D), with optimal performance obtained using a M inimumCount value of 100 (see Methods) throughout various coverage levels. HIFI-MRF performs the best overall ( Fig. 1D and Suppl. Fig.  1C, 1D), except at extremely low sequencing depths (i.e., 6-12M read-pairs). Indeed, for typical sequencing depths (100-250M read-pairs), HIFI-MRF improves IF estimation accuracy over the entire range of genomic distances (Fig. 1E, Suppl. Fig. 2), producing estimates that are 5-40% more accurate than those obtained by fixed-binning approaches and 5% more accurate than HIFI-KDE and HIFI-AKDE. HIFI approaches also produce estimates that are more accurate than those of HiCPlus (17), a machine-learning technique for high-resolution analysis of Hi-C data, especially at shortrange distances (see Fig. 1D and 1E, Suppl. Fig. 1D and 3, and Methods).
Validation against 5C data. 5C has been used to study the conformation of moderate-size genomic regions (100 kb -5 Mb), including the beta-globin locus (22,23), the HOX clusters (8, 24, 25), the CF T R locus (26, 27) and the Xist locus (7). 5C allows for a high sequencing depth measure-ment of the IF of each RF pair within given genomic regions, which improves the accuracy of RF-level IF estimates. As such, 5C data constitutes an excellent benchmark to compare different inference approaches. We analyzed data from two cell types for which both 5C and Hi-C data are available: (i) a 4 Mb region around the Xist gene ( Fig Hi-C data from Rao et al., 2014). In the GM12878 dataset, which has higher Hi-C sequencing depth (~760M mapped read-pairs genome-wide), the correlation between raw Hi-C and 5C data is moderate (Spearman ρ s = 0.45; Suppl. Fig.  4C), but it is improved by the application of HIFI-MRF (ρ s = 0.71; Suppl. Fig. 4D, E). In the mESC dataset, with lower Hi-C sequencing coverage (~122M read-pairs), the correlation of raw 5C against raw Hi-C data is relatively weak (ρ s = 0.27; Fig. 2C), but improves to nearly the same level as the first dataset from the application of HIFI-MRF (ρ s = 0.69; Fig. 2D, 2E). In both cases, the intricate structure of TADs, as well as some of the finer looping events become apparent in the HIFI-MRF-processed Hi-C data ( Fig. 2D and Suppl. Fig. 4D). Indeed, the application of HIFI-MRF to Hi-C data allows for the detection of regulatory contacts that could previously only be observed using 5C. For example, Nora et al. (2012) used 5C to observe a long-range interaction between T six and its transcriptional regulator -a large intervening noncoding RNA called "Linx" -occurring in female mice as a component of X-inactivation. This interaction is very clearly observed in the HIFI-MRF-processed Hi-C data (Dixon et al., 2012), whereas it is difficult to distinguish from background in raw or binned Hi-C data ( Fig. 3A and B). These results demonstrate that HIFI-MRF can be used to analyze existing Hi-C data sets and potentially lead to novel discoveries at finer genomic scales.
Validation against externally-predicted chromatin contacts. To more fully assess the extent to which HIFI-MRFprocessed Hi-C data can be used to identify biologically relevant contacts, we asked whether it can also confirm chromatin interactions found through alternative approaches. Specifically, we considered a set of contacts identified by Chromatin Interaction Analysis with Paired-End Tag Sequencing (ChIA-PET (28)) in GM12878 cells, bound either by CTCF (92,114 contacts (29)), RNA Polymerase II (PolII -192,394 (29)), or RAD21 (38,952 (30)). We also considered a set of computationally inferred contacts identified by correlation of DNAse I hypersensitivity signals across multiple cell types (31). For each set of contacts, a set of negative (control) fragment pairs were chosen by randomly repairing the same RFs. We then measured, for each range of genomic distance, the extent to which positive contacts could be distinguished from negative contacts on the basis of normalized HIFI-MRF Hi-C data, by measuring the Area Under the Receiver Operating Characteristic curve (AUROC) of a univariate predictor using the RF pair's inferred IF value

Fig. 1. Cross-validation of fixed-binning and HIFI methodologies.
A) Schematic representation of cross-validation methodology to assess the accuracy of fixed-binning and proposed HIFI methodologies. B) Cross-validation error for fixed-binning approaches, for different bin sizes, as a function of coverage. See also Suppl. Fig. 1 for similar analyses for HIFI-KDE and HIFI-AKDE. C) Analysis of fixed-binning error (relative to error with 1 RF per bin) across genomic distance between RF-pairs. No singular bin size performs best for all genomic distances. D) Comparison of errors for different approaches. For fixed binning and HIFI-KDE, the optimal bin size or bandwidth was chosen separately for each coverage level. Nonetheless, HIFI-MRF outperforms all other approaches, including HiCPlus (17). E) Comparison of errors (relative to error obtained with fixed binning using 16 RF per bin) by genomic distance of RF pairs, using as input a set of 304M read pairs (50% of total training set). HIFI-MRF performs best across all distances.  (7).
Note the use of true-size heatmaps, where the height (resp. width) of a row (resp. column) is proportional to the size of the RF it represents. B) Raw, RF-resolution Hi-C data for the same region (6). C) Correlation of 5C and raw Hi-C data at RF resolution (Spearman ρs = 0.27, two-sided Student's t-test p-value< 10 −16 ). D) IF matrix estimated by HIFI-MRF from the same Hi-C data. Observe the similarity to the 5C data in (A). E) Correlation of 5C and HIFI-MRF processed Hi-C data at RF resolution (Spearman ρs = 0.69, p-value< 10 −16 ).
as a predictive variable. Higher AUROC values indicate improved ability to distinguish positive from negative contacts. We observe that HIFI-MRF-processed Hi-C data allows significantly better detection of validated contacts compared to fixed-binning approaches, for all four datasets, across all genomic distance ranges, and both at low (~61M read-pairs obtained by down-sampling; Fig. 4A-C and Suppl. Fig. 5A-C, 6A and 6B), and high (~608M read-pairs; Fig. 4D-F, Suppl. Fig. 5D-F, 6C, and 6D) sequencing depth. Notably, the ability to distinguish positive from negative ChIA-PET contacts is relatively poor at short distances (<50 kb) because nearly all pairs have very high IF values, but improves considerably at longer range (300-500 kb). In contrast, contacts inferred based on DHS correlations are more difficult to identify overall (AUROC<0.6), becoming increasing so at longer ranges. We speculate that this loss in detection power may be due to an increased error rate present in this benchmark dataset.
Remarkably, the application of HIFI-MRF to low-coverage Hi-C data yields predictive power that is nearly as good as in the high-coverage dataset (compare panels Fig. 4A-C to D-F), suggesting that HIFI-MRF is able to identify functional contacts even in Hi-C data of moderate depth. Fig. 4 also includes results for HiCPlus (17) and HMRFBayes (15), an approach for the detection of significant contacts at RF resolution (see Methods). Overall, HIFI-MRF clearly outperforms these two approaches, although HMRFBayes performs nearly equally well for some low-coverage data sets ( Fig. 4A-C). The advantage of HIFI-MRF is particularly noticeable at short-to medium-range distances (<200 kb). Taken together, these results show that using HIFI-MRF to process Hi-C data improves the ability to delineate individual chromatin contacts.
HIFI allows new insight into fine-level genome organization. The high accuracy and resolution afforded by HIFI enables researchers to answer questions that are difficult to address with lower-resolution analyses of Hi-C data. Here, we illustrate one such application: the high-resolution analysis of TAD and subTAD boundaries. We used a modified directionality index (DI) score, originally introduced by Dixon et al. (2012; see Methods), to identify 5,000 TAD boundaries in the HindIII-GM12878 Hi-C data. Boundary predictions were performed at two resolutions: (i) RF-resolution using HIFI-MRF-processed data (3.7 kb on average; Fig. 5A, top heatmap) and (ii) classical fixed-binning approach (16 RF ≈ 50 kb per bin; Fig. 5A, bottom heatmap). Using ENCODE ChIP-seq datasets (34), we quantified the occupancy of DNAbinding proteins relative to TAD boundaries. Consistent with previously reported observations and models (35)(36)(37), CTCF (Fig. 5B) showed a remarkable enrichment immediately outside of these boundaries, with sites on the plus strand sharply peaking at upstream TAD boundaries and those on the minus strand peaking at downstream boundaries. Similar enrichments at TAD boundaries are observed for RAD21, SMC3 (Cohesin complex), and ZNF143 (Suppl. Fig. 7), consistent with previous reports (6,(38)(39)(40)(41). Although the same phenomenon is visible in fixed-binning data, the peaks are much sharper (narrower and higher) in HIFI-MRF data, indication that RF-resolution allows more accurate calls of TAD boundaries.
We next studied the role of TAD boundaries in gene regulation, by looking at the distribution of active regulatory regions, as annotated by ChromHMM (42) based on celltype-specific histone marks and DNA accessibility data. We observe a moderate enrichment for active promoters immediately outside TAD boundaries (only visible in HIFI-MRF processed data) and for strong enhancers within TADs. This trend is partially reflected in the occupancy profiles of several transcription factors (Fig. 5D and Suppl. Fig. 8). Some transcription factors (in particular MEF2A, MEF2C, MTA3, NFIC, RELA, RUNX3, and SPI1) exhibit a gradual enrichment toward the middle of TADs, together with a small but well-defined, CTCF-like peak just outside TAD boundaries. Others (e.g., CHD1, CHD2, FOXM1, IRF4, PAX5, and PML) also show the same peak at TAD boundaries but little within-TAD enrichment (Suppl. Fig. 9). Notice that in many cases, the enrichment at TAD boundaries is only apparent based on HIFI-MRF data and would likely be missed using data binned at 50 kb resolution. We then repeated the analysis (HIFI-MRF followed by TAD boundary calls) on Hi-C data generated on the same cell line using the 4-cutter MboI restriction enzyme, with cut sites every 434 bp on average. The extremely high resolution of this dataset (Fig. 5E) provides opportunities to study fine structures such as subTADs (8,9), which are difficult to study at lower resolutions. We used the HIFI-MRF MboI-GM12878 data and the same modified DI approach to identify a set of 25,000 domain boundaries, of which approximately 2,500 matched a HindIII-GM12878 TAD boundary (within 25 kb). The remaining~22,500 boundaries are not detected in the HindIII data and likely correspond to subTAD boundaries. Repeating the occupancy analysis against subTAD boundaries, the same enrichment for convergent CTCF sites is observed (Fig. 5F), but a very different picture emerges with respect to regulatory regions. Most notably, active promoters, and to a lesser extent strong enhancers, have a clear tendency to occupy regions that lie immediately outside subTADs ( Fig  5G; see also example in Fig. 5E). Indeed, the density of active promoters is approximately 30 times higher in the 1 kb region that precedes a subTAD boundary than in the 1 kb region that follows one. A similar enrichment is found in inter-subTAD regions for FOXM1 and NFIC (Fig 5H), and nearly all tran-scription factors studied. These results are consistent with a model where active regulatory regions play a key role in partitioning TADs into subTADs.

Methods
HIFI: Fragment-specific bias calculation. Factors such as fragment size, GC content, and mappability affect the observed read count matrix RC. For each fragment i of chromosome c, we estimate this bias as where n reads is the total number of read pairs mapped to the chromosome c and n f ragments is the number of RF on that chromosome. Computed biases are used to obtain a normalized read count matrix nRC, where nRC i,j = following estimate of interaction frequency for RF pair (i, j): (HIFI-KDE). This approach follows the standard two-dimensional Kernel Density Estimation (KDE) procedure (43,44), where predicted IF pred for RF pair (i, j) is obtained as a weighted sum of the entries of RC surrounding (i, j), parameterized by bandwidth parameter h. Specifically, we set w(a, b; h) , where w(a, b; h) = e − a 2 +b 2 2h 2 √ 2πh 2 given h. Near the edges of the matrix, values of a and b such that indices (i + a, j + b) fall outside the matrix are excluded from the sums of both the numerator and denominator.

HIFI: Adaptive Kernel Density Estimation (HIFI-AKDE).
This approach is similar to the fixed KDE, except that the value of the bandwidth parameter h is chosen separately for each pair (i, j). Specifically, we choose h i,j to be the smallest value such that where 'M inimumCount' is a user-defined parameter. In other words, regions of the matrix that tend to have larger RC values are estimated using smaller bandwidths (i.e. higher resolution), whereas those that are more sparse use larger bandwidths. HIFI-AKDE results in a fine resolution in dense regions and a lower resolution in more sparse areas of the matrix. In order to speed up the computation of h i,j , we use a precomputed cumulative matrix, cumRC, where allowing to calculate cov i,j in constant time:

HIFI: Markov Random Field Estimation (HIFI-MRF).
A Markov Random Field (MRF) describes a set of random variables interconnected via a lattice of dependencies. Let us denote by IF MRF i,j the IF value we aim to estimate at position (i, j). We model dependencies between neighboring cells as where M edianN eighborhood i,j is the median of the eight IF MRF cells surrounding cell (i, j). We chose to model this dependency using the median instead of the mean of the neighbors because it allows for sharper transitions regions such as TAD boundaries. The value of σ 2 i,j is set to α · log (M edianN eighborhood i,j ), where α is a userdefined parameter empricially set to 0.2. We model the dependency between the observed read count RC i,j and the estimated true IF value IF M RF i,j using a Poisson distribution: We then seek the matrix IF MRF that maximizes P r[RC, We first initialize the IF MRF matrix using the output of the HIFI-AKDE algorithm. We then optimize IF MRF using a modified gradient descent approach. Each entry IF MRF i,j is revised so as to maximize the local joint likelihood. This process is repeated until convergence, which usually takes less than 5-10 iterations. Despite using of the median rather than the mean to model inter-cell dependencies, some bleed-in effect is observed at TAD boundaries. To prevent those, we designed an approach where the nRC matrix is first scanned to identify sharp horizontal or vertical transitions characteristic of TADs. Horizontal boundaries are defined by a row index i and a pair of column indices j and j , and will be set if the distribution of nRC values in nRC i,j...j differs significantly from that in nRC i+1,j...j , as determined by a Kolmogorov-Smirnov test. More precisely, boundaries are set greedily, starting with the most significant boundary matrix-wide, and iteratively adding more boundaries, provided they do not overlap previously set boundaries, until the KS statistic falls below a user-defined threshold (the value of 1.5 was used here). Vertical boundaries are obtained similarly. Boundaries are then used in the HIFI-MRF model to prevent certain neighbors from contributing to the neighborhood median of a given cell. Specifically, cells (i , j ) that sit on the opposite side of a boundary from cell (i, j) are excluded from the neighborhood of (i, j).
HIFI: Outputting normalized matrices. HIFI can either produce a normalized or non-normalized output. Normalized outputs are produced by the approaches described until now. Non-normalized outputs are obtained as IF i,j · bias i · bias j . In this manuscript, normalized outputs were used throughout, except for the cross-validation experiment.
HIFI: Implementation and availability. The HIFI package is available at https://github.com/BlanchetteLab/HIFI. It consists of a C++ program for IF estimation, together with Python scripts for input data formatting and the true-size IF matrix visualization.
HiCPlus. The source code for HiCPlus (17) was obtained from https://github.com/zhangyan32/HiCPlus. Models were trained on Hi-C data from chromosomes 1-8 at 10 kb resolution, within a range of 2 Mb, as recommended. Input and target contact frequencies were obtained from input set and test RC matrices, respectively. Models were provided 100 epochs (10 times more than recommended) to converge while ensuring overfitting did not occur. Model output was then transformed from 10 kb to RF resolution based on the number of nucleotides each RF contributed to a given 10 kb bin.

HMRFBayes.
A Java implementation of the Hidden Markov Random Field based Bayesian method (HMRFBayes (15)) was obtained from http://www.unc.edu/ xuzheng/HMRFHiCFast/tutorial.php. The HMRFBayes program was provided the observed and expected contact frequencies for paired restriction fragments within 1 Mb bins along chromosomes 9-X, where the expected contact frequency was calculated as follows:

otalReadP airInM atrix
Hi-C read-pair pre-processing. The publicly available Hi-C User Pipeline (HiCUP (45)) v0.5.3 was used to process raw sequencing reads. HiCUP-mapped reads to the human (hg19) genome are also filtered to remove expected artifacts resulting from the sonication and ligation steps (e.g., circularized reads, reads with dangling ends) of the Hi-C protocol. Mapped reads were further filtered for a Mapping Quality Score (MAQ) greater than 30 to ensure unique mappability (19). BAM/SAM-mapped read files were then converted to a sparse matrix TSV file format (by our 'BAM-toSparseMatrix.py' script) before use with HIFI.
Directionality index and TAD prediction. he directionality index (DI) was first described by Dixon J.R. et al. (2012) to detect directionality bias for interactions across a Hi-C IF matrix. The DI for a given RF is usually calculated as follows: A and B are the sum of all interactions within a window located either upstream (A) or downstream (B) of an RF (window size of 500 kb is used here). E is the expected number of reads under the null hypothesis (i.e., there is no interaction bias for the given RF). Due to the low coverage at RF-resolution Hi-C data, the DI formula yields very noisy predictions. We thus used the following modified version: This modification transforms terms present in the right parentheses to error rates and helps to scale the magnitude of the DI. TAD boundaries are defined as RFs whose DI' value is a local maximum or minimum in a window of 21 RFs (51 for MboI analyses) centered around it. In the case of the fixed binning (b=16) analysis, only RFs at the center of their bin are considered. Finally, due to their low coverage, regions within 2 Mb of a centromere or telomere were excluded.

ENCODE ChIP-seq peaks, ChromHMM, and CTCF
ChIP-seq signal data pre-processing. ChIP-seq data from ENCODE (34) and ChromHMM (42) predictions were downloaded from the UCSC Genome browser (46) and binned to HindIII and MboI RFs. For Fig. 5C and 5G, only states 1 and 4 were used (to reduce redundancy). CTCF motifs and orientation were identified in a similar manner to Fundenberg et al. (2016) using HOMER (47) and the 'CTCF_known1' PWM (48). CTCF ChIP-seq signal data was then parsed for the total signal value covered per motif and the retained sums were then binned by expected RFs.

Discussion and Conclusions
Hi-C has become a commonly used approach to map 3D chromatin organization genome-wide. Since its introduction in 2009, the method has been updated many times to improve upon accuracy and resolution, or to target specific types of contacts. However, to date, using Hi-C data to accurately and systematically identify fine-scale chromosome contacts remains challenging, mostly because the sequencing depth required to achieve high-resolution contact maps is too great. To overcome the sparsity of contact information and increase the signal-to-noise ratio, Hi-C data is traditionally binned at fixed intervals along chromosomes to produce lower-resolution matrices (10). This lower-resolution representation of Hi-C data limits its application in studies of genomic regulatory networks or mechanisms of disease, which require robust, high-resolution 3D genomics data.
Here, we introduced HIFI, a family of density estimation algorithms that allow for the observation of high-resolution (at the restriction-fragment scale) genomic contacts from Hi-C data of various sequencing depths. Our results show that HIFI algorithms, and in particular those based on Markov Random Fields (HIFI-MRF), provide highly accurate estimates of Hi-C interaction frequency at RF resolution, and outperforms classical fixed-binning approaches and previously published methods such as HiCPlus (17). We demonstrate that HIFI-MRF recapitulates contact data obtained by 5C and also captures interactions detected by ChIA-PET (Fig. 4) better than HiCPlus and HMRFBayes (15). Unlike the former, HIFI is easy to use and does not require special equipment (GPUs) to run within a reasonable timeframe. Our method also runs more than 10 times faster than the HMRFBayes. The high resolution and accuracy provided by HIFI allows analyses and discoveries that could not be made with lower-resolution Hi-C data. For example, HIFI allows for the identification of TAD boundaries at RF resolution, which provides a unique opportunity to finely delineate the role of different DNAbinding proteins. Benefiting from the RF-resolution achieved with HIFI-MRF, we show that CTCF, RAD21, SMC3, and ZNF143 are enriched just outside both TAD and subTAD boundaries and their sharp depletion within-TADs may be a major contributor to the formation of TAD boundaries (Fig.  5). In addition, we detail a set of transcription factors (based on ENCODE ChIP-seq data) that are found to be enriched at RFs labeled as TAD boundaries (Fig. 5B, C). Finally, we highlight the new observation that active enhancers and promoters appear to provide structure to TADs, whereby DNA located between consecutive active regulatory regions form subTADs. This is obviously just an illustration of insights that can be gained from the analysis of Hi-C at high resolution. Others would include the use of HIFI-processed Hi-C data to further dissect the mechanisms of genome organization, and to prioritize non-coding variants obtained from genome-wide association (GWAS) or expression quantitative trait loci (eQTL) studies, as is starting to be done with capture Hi-C data (49).
While HIFI provides a significant improvement over previous methodologies for handling Hi-C matrix sparsity, there remains several directions for possible improvements. First, HIFI is relatively slow, requiring roughly an hour per chromosome (at HindIII resolution), due to the size of the matrices analyzed and the complexity of MRF-based inference. Improved algorithms, multi-threading, and GPU-based computation are expected to provide significant speed-ups and are under development. These improvements will also allow the calculation of confidence intervals for estimated contacts frequencies, using Markov Chain Monte Carlo sampling. Machine-learning (ML) approaches, such as convolutional neural nets (CNN), offer an alternative to probabilistic approaches like HIFI-MRF. In recent work by Zhang et al. (2018), the authors showed that CNNs can be trained on Hi-C data to increase the resolution from 40 kb to 10 kb. Being model-free, ML approaches have the potential to discover and take advantage of unsuspected dependencies in the data. However, these models have yet to produce RF-resolution data and thus, remain limited in their ability to provide biological support as shown in this manuscript. In addition, be-ing intrinsically complex models, prediction errors may occur in unexpected manners.
In conclusion, the HIFI algorithms and software described in this manuscript allow for accurate, high-resolution analyses of 3D genome organization using Hi-C data. RF-resolution Hi-C data allows for the recapitulation of observations made by 5C, a better separation of positive and control/background contacts, RF-resolution TAD and subTAD boundary calling, and the identification of potential DNA-DNA contacts and TF enrichments that drive changes in chromatin architecture and gene regulation. By operating upstream of many Hi-C data analysis tools (e.g. loop, TAD, and compartment predictors as well as fragment-bias normalization), HIFI can easily be inserted in a number of Hi-C data analysis pipelines, and we believe that the research community will be quick to take advantage of this family of new algorithms. Area under the receiver-operator curve (AUROC) comparison for a univariate predictor applied to positive and negative contact populations for ChIA-PET RAD21 (28). Left and right columns represent the classifier applied to either a set of positive/negative RF contacts within 1 Mb bins along chromosomes 9-X or entire contact maps genome-wide, respectively. Top and bottom rows represent the performance of the classifier applied to Hi-C data of 60.8M (10%) and 608M (100%) input set size, respectively. HIFI-MRF is found to provide more accurate (based on AUROC) predictions of RF-pair classification (positive vs. negative) compared to other inference methods genome-wide, but within 1 Mb bins along chr9-X, a clear improvement is not observed over HiCPlus (17).