HSA: integrating multi-track Hi-C data for genome-scale reconstruction of 3D chromatin structure

Genome-wide 3C technologies (Hi-C) are being increasingly employed to study three-dimensional (3D) genome conformations. Existing computational approaches are unable to integrate accumulating data to facilitate studying 3D chromatin structure and function. We present HSA (http://ouyanglab.jax.org/hsa/), a flexible tool that jointly analyzes multiple contact maps to infer 3D chromatin structure at the genome scale. HSA globally searches the latent structure underlying different cleavage footprints. Its robustness and accuracy outperform or rival existing tools on extensive simulations and orthogonal experiment validations. Applying HSA to recent in situ Hi-C data, we found the 3D chromatin structures are highly conserved across various human cell types. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0896-1) contains supplementary material, which is available to authorized users.


Background
Three-dimensional (3D) chromatin conformation plays crucial roles in diverse genome functions, such as transcriptional regulation [1], DNA methylation [2], replication [3] and cohesin binding [4]. Elucidating 3D chromatin conformation can provide a mechanistic understanding of various biological processes and human diseases. Therefore, it is important to capture 3D chromatin conformation and relate it to genome function. 3D chromatin conformation has traditionally been studied by cytogenic methods, such as florescent in situ hybridization (FISH) [5]. Recently, several experimental technologies have been developed to capture chromatin conformations at multiple scales. For instance, the chromosome conformation capture (3C) technique has been used to study chromatin structure in living cells [6]. It derives the circularized chromosome conformation capture (4C) [7], which is able to detect many genomic loci interacting with a DNA region of interest. It is further extended to carbon copy chromosome conformation capture (5C), which allows for largescale detection of 3D chromatin interactions [8]. Further, Hi-C was introduced to dissect 3D chromatin structure at the genome-scale [9]. Another technology, chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), detects genome-wide chromatin interactions mediated by a protein of interest [10]. These technologies have generated large amounts and diverse types of data. To interpret these data appropriately and advance biological understanding, it is crucial to develop statistically sound approaches to their modeling and analysis.
Here, we focus on Hi-C for a genome-scale analysis of chromatin conformations. Hi-C data are usually summarized into a contact map, which reflects the physical proximity between pairs of genomic loci at the genome scale. In a Hi-C contact map, an off-diagonal entry represents the number of paired-end reads spanning two different loci. The complex steps of Hi-C experiments introduce various biases, such as restriction enzyme cutting, GC content and sequence uniqueness [11]. For instance, Hi-C employs different restriction enzymes, such as NcoI (recognizing CCATGG) and HindIII (recognizing AAGCTT), which results in different genomic cutting sites and, consequently, contact maps. Some existing efforts are normalizing Hi-C contact maps to reduce systematic biases buried in the Hi-C experiments, either parametrically [12] or non-parametrically [11].
One of the most important goals of a Hi-C data analysis is to reconstruct 3D chromatin structures of the genome. Elucidating the 3D chromatin structure of the genome is important as it improves the mechanistic understanding of various gene regulatory events that are orchestrated in the nucleus of living cells. Also, transforming contact maps to 3D chromatin structures can be regarded as a dimension-reduction (noise filtering) procedure, as the degrees of freedom reduce from O(N 2 ) to O(3N), where N is the number of genomic loci. The improvement is substantial, especially at the genome scale, as N is typically very large when many loci are involved.
A Hi-C experiment requires millions of cells. Therefore, chromatin interactions captured by Hi-C reflect the consensus structural conformation of the whole population of cells. Some existing computational efforts infer the consensus 3D chromatin structure. Some are based on optimization of target functions with pre-specified constraints [13], e.g., ChromSDE [14] (employing a semidefinite programming approach), ShRec3D [15] (combining shortest-path distance with multidimensional scaling) and others [16][17][18][19]. However, these optimization-based models may be trapped in local optima, particularly at low signal coverage (the percentage of non-zero entries in a contact map), and do not consider Hi-C experimental uncertainties. Statistical approaches have been developed to model the uncertainties in Hi-C experiments explicitly. For instance, MCMC5C [20] models Hi-C data through a Gaussian model. In this model, there are no bias removal steps, and the Gaussian variance estimate is ad hoc. To overcome these limitations, BACH [21] and PASTIS [22] employ Poisson models combining bias removal with 3D structure reconstruction. Due to limited availability of data, the reliability of these models remains to be tested when reconstructing 3D chromatin structure at the genome scale (for a more comprehensive review, see [23]). Importantly, all these existing approaches for 3D chromatin structure reconstruction are designed for singletrack Hi-C data from only one restriction enzyme. It is likely that one can obtain improved 3D models through integrative modeling of multi-track Hi-C data combining different restriction enzymes. Moreover, few existing methods consider the local dependence of neighboring loci, thus they are sensitive to the sparsity of Hi-C contact maps. In addition, none of the existing methods has been assessed on a wide range of independent experimental data. Finally, no approaches have been shown to give consistent performance at the genome scale across various cell types. In this paper, we propose a novel approach named HSA, to reconstruct 3D chromatin structures at the genome scale by leveraging multi-track Hi-C data and modeling the local dependence of neighboring loci explicitly. To our knowledge, this is the first approach integrating multi-track Hi-C data for 3D chromatin structure reconstruction at the genome scale. We assess HSA extensively through simulations and real applications on Hi-C data from four cell lines. We also apply HSA to a recent in situ Hi-C study of eight cell lines. We use orthogonal data sets from FISH and ChIA-PET experiments available for the cell lines as independent validations of the reconstructed 3D chromatin structures. The assessments demonstrate improved performance of HSA over a number of existing approaches across different cell lines at the genome scale. The study provides insights on the conservation of 3D chromatin structure across various human cell types.

Method overview
An overview of HSA is illustrated in Fig. 1. HSA takes one or more Hi-C contact maps of the same resolution as input to reconstruct a consensus 3D chromatin structure. It utilizes the generalized linear model (GLM) with an iterative algorithm, which combines Hamiltonian dynamics Fig. 1 Overview of HSA for 3D chromatin structure reconstruction from multi-track Hi-C data. HSA integrates multiple Hi-C contact maps from different restriction enzymes to reconstruct the underlying 3D chromatin structure. Color from blue to red represents chromosome position from the start to the end. 3D three-dimensional, GLM generalized linear model, SA simulated annealing with simulated annealing (SA), a global search strategy to explore the model space. It provides an option of Markov modeling when the contact maps have low signal coverage. The input for HSA can be either raw contact maps with count data or normalized contact maps obtained through existing approaches, such as Yaffe et al. [11]. The details of the HSA method are described in Section "Materials and methods".

Assessments and comparisons on simulated data
In the simulation, we compared HSA with seven published methods: BACH [21], ChromSDE [14], ShRec3D [15], MCMC5C [20], AutoChrom3D [24], PASITS [22] and TADbit [25]. We used contact maps simulated from a regular helical structure (see Section "Materials and methods" for detailed derivations). We applied the methods to the simulated contact maps (see Additional file 1 for detailed implementation of each method). To test the effect of sparsity in the contact maps on the accuracy of the methods, we simulated contact maps at three signal coverage levels: 90 %, 70 % and 25 %. The true structures and the fitted structures from the eight methods at the three signal coverage levels are shown in Additional file 1: Figure S1. We calculated the Pearson correlation coefficients (PCCs) and the root-mean-square deviations (RMSDs) between the true structures and the fitted structures for the eight methods (Fig. 2). Specifically, PCC measures the correlation between the real structure and predicted structure across the pairwise spatial distances among all loci. RMSD is calculated as the minimum root mean of squared distances between the 3D coordinates of each loci in the real and predicted structures (see Additional file 1 for detailed derivations). Clearly, HSA outperforms the others with the lowest RMSDs and the highest PCCs at all three signal coverage levels. To demonstrate the advantage of multi-track fitting uniquely implemented in HSA, we used HSA for joint modeling of the contact maps at 70 % and 25 % signal coverages. This multi-track fitting outperforms all its single-track counterparts and has even better performance than some methods at 90 % signal coverage. This suggests that combining information from multiple contact maps may improve the accuracy of 3D chromatin structure reconstruction. Another unique feature of HSA is the option of Markov modeling. To investigate the utility of this feature, we applied BACH, HSA without Markov modeling, and HSA with Markov modeling on a contact map with 10 % signal coverage. HSA with Markov modeling clearly outperforms the other two with the lowest Fig. 2 Assessment and comparison of 3D chromatin structure reconstruction methods for a regular helical structure. The fitted structures of the eight methods were compared to the true regular helical structure and a PCC and b RMSD were calculated. PCC Pearson correlation coefficient, RMSD root-mean-square deviation RMSD and the highest PCC (Additional file 1: Figure S2). We also assessed the eight methods using contact maps simulated from a random-walk structure at 30 % signal coverage levels (see Section "Materials and methods" for detailed derivations). Again, HSA outperforms all the others with the lowest RMSDs and the highest PCCs (Table 1 and Additional file 1: Figure S3). We found that HSA with Markov modeling becomes increasingly important as the signal coverage goes down from 30 % to 10 % (Additional file 1: Table S1 and Additional file 1: Figure S4). Based on these simulation results, we suggest Markov modeling in HSA when the signal coverages in contact maps are less than 10 %.
The above simulations are based on the consensus structure assumption, i.e., there is only one true structure underlying a contact map. Among the tested methods, BACH, MCMC5C and TADbit are based on the assumption that there is an ensemble of structures underlying a contact map. To compare methods when the true structures are not unique, we employed the toy models used in developing the TADbit method [26]. In each toy model, a contact map was simulated from multiple structures with a certain noise level and structural variation. We applied BACH, MCMC5C and HSA to the contact maps of the toy models. We also extracted the structure predicted by TADbit based on the lowest integrative modeling platform objective function model for each contact map [26]. We then compared the predicted structure of each method to all the underlying structures of each toy model. We calculated the PCCs and RMSDs of each method at each combination of noise levels and structural variations in the toy models. As seen in Figs. 3 and 4, although HSA is a consensus-structure-based model, its performance is comparable to TADbit and better than BACH and MCMC5C on the toy models based on the ensemble structure assumption.

Application to Hi-C data of four cell lines
We applied HSA to the Hi-C data of four cell lines: mESC [27], GM06990 [9], K562 [9] and MCF7 [28]. mESC and GM06990 have Hi-C contact maps of both NcoI and HindIII, while K562 and MCF7 have those of HindIII only. To demonstrate the advantage of integrating information from multi-track Hi-C contact maps, we fitted HSA for both multi-track and single-track. For comparison, we also applied BACH [21], ChromSDE [14] and ShRec3D [15] to the same data sets. We fitted HSA and other models using the same inputs. Specifically, we fitted HSA and BACH using the raw contact maps with enzyme cut fragment length, GC content and mappability as covariates for bias correction. We also fitted HSA, ChromSDE and ShRec3D using the normalized contact maps processed by the same pipeline [11], since the latter two do not have an internal bias correction process.
Multi-track and single-track fittings of HSA result in consistent 3D structures, as shown in Fig. 5. At 200kilobase (kb) resolution, the 3D structures of the entire chromosome 14 of GM06990 show relatively smaller differences when fitted by HSA on NcoI and HindIII contact maps jointly, NcoI contact map only, and HindIII contact map only (Fig. 5a-c). The lowest value of pairwise PCCs between the three structures is 0.76. The 3D structures fitted by BACH on NcoI and HindIII contact maps exhibit a larger difference (PCC = 0.53) with some notable outlier loci (Fig. 5d, e). ShRec3D-derived 3D structures from NcoI and HindIII contact maps have the largest difference ( Fig. 5f, g, PCC = 0.37). ChromSDE was computationally overburdened on the contact maps at 200-kb resolution when tested on our computer cluster (Bright Cluster Manager v5.2, CentOS 6.0, 128 GB of RAM per system board).
We then reconstructed the 3D structure of each chromosome at 1-Mb resolution to compare the four methods across all four cell lines. To measure how well the fitted 3D structures explain the input contact maps, we transformed the pairwise distances in 3D into a fitted contact map by the power-law relationship F ij ∼ d α ij , where F ij is the (i, j) entry of the fitted contact map, d ij is the 3D distance and α is the track-specific power-law coefficient. Specifically, we estimated α as β c1 by the GLM framework of HSA. For BACH and ChromSDE, α was estimated by their respective models. For ShRec3D, we estimated α by fitting a GLM between the normalized contact maps and ln(d ij ) with the Poisson link function. Then, we calculated the PCCs between the input and fitted contact maps for all chromosomes.
As shown in Fig. 6, HSA-derived 3D structures fit the input Hi-C data better than the other three methods across all four cell lines in most cases. Notably, HSA fits equally well on both raw and normalized contact maps, while all the other three methods only work on one input data type. For normalized contact maps, HSA fits are clearly better than ChromSDE and ShRec3D (Fig. 6b, d, f and h), which is likely due to the superiority of the global search strategy of SA employed in HSA. The increase of goodness-of-fit is the largest for MCF7 (Fig. 6h), in which the contact maps are very sparse (5-13 % signal coverages across chromosomes).
Moreover, multi-track and single-track fittings of Hi-C data by HSA result in consistent goodness-of-fit, which indicates the robustness of HSA in identifying the same underlying 3D structure probed by different restriction enzymes. We found that the 3D structures derived by multi-track HSA fitting explain the contact maps of NcoI and HindIII equally well. This lies in the ability  of HSA to calculate track-specific power-law coefficients for distance transformation when fitting multitrack Hi-C contact maps. As shown in Fig. 7, HSA derives different power-law coefficients for different contact maps, in which NcoI has a smaller powerlaw coefficient than HindIII does across all chromosomes in GM06990. This indicates that simply pooling different contact maps together is suboptimal, as the discrepancy in power-law coefficients breaks the additivity. Also note that the power-law coefficient has a high variability among chromosomes, which suggests that it might be inappropriate to assume a universal power-law coefficient for 3D reconstruction across all chromosomes.
The high correlations between predicted structures and input contact maps indicate our model can explain well the Hi-C data. But it is not a measure of the method accuracy per se. In the following sections, we sought to use orthogonal data, such as those from FISH and ChIA-PET, to validate our predictions with Hi-C data.

Validations and comparisons using FISH data
We validated the 3D chromatin structures reconstructed from Hi-C contact maps with independent FISH data available for the cell lines mESC and GM06990. In mESC, FISH probes span a 32-Mb region on chromosome 2 and a 65-Mb region in chromosome 11 at 40-kb resolution [29]. We applied all four methods except for ChromSDE (which ran out of memory) on Hi-C contact maps at 40kb resolution. HSA was fitted using the raw contact maps of NcoI and HindIII jointly, those of NcoI only, and those of HindIII only. BACH was fitted using the raw contact maps of NcoI only and those of HindIII only. ShRec3D was fitted using the normalized contact maps of HindIII. We then calculated the PCCs between the predicted distances based on the 3D structures and the corresponding FISH-measured distances between the probed loci pairs. Each FISH locus overlaps with two binned loci in the Hi-C contact maps. So we tried different combinations (e.g., left-left, left-right, etc.) of the two bins at both ends of the FISH-probed loci pair when calculating the predicted distances and obtained a range of PCCs for each FISH data set. The PCCs of multi-track fitting of HSA are most robust (Fig. 8a) and significantly higher than those of single-track HSA on NcoI, BACH on NcoI, and ShRec3D on HindIII (p < 0.02 under a right-tailed T-test, Additional file 1: Table S2). This was marginally significant when comparing the PCCs of multi-track HSA with those HSA-multi HSA using multiple raw contact maps as the input, HSA-multinorm/singlenorm HSA using multiple normalized contact maps/a single normalized contact map as the input, HSA-single HSA using a single raw contact map as the input, PCC Pearson correlation coefficient of single-track HSA on HindIII (p = 0.0619) or BACH on HindIII (p = 0.0853), in which the former is mainly due to an outlier while the latter has evidently larger variance (Fig. 8a). In GM06990, FISH probes span chromosomes 14 and 22 at 200-kb resolution [9]. We were able to reconstruct the 3D structures of the entire chromosomes using HSA, BACH and ShRec3D using 200-kb resolution Hi-C maps, while ChromSDE ran out of memory. Again, HSA is more robust and accurate compared to BACH and ShRec3D, in which the PCCs of multi-track HSA are significantly higher than those of the other six approaches (p ≤ 0.0003 under a right-tailed T-test, Additional file 1: Table S3).

Validations and comparisons using ChIA-PET data
We further validated the 3D chromatin structures using publicly available ChIA-PET data of RNA PolII in mESC [30], K562 [31] and MCF7 [31]. These ChIA-PET data provide genome-wide chromatin interactions mediated by RNA PolII. We reasoned that the 3D distances between loci pairs with ChIA-PET interactions (loops) are smaller than those of non-interacting pairs (non-loops) among the RNA PolII anchors. So we extracted all genomic loci involved in interactions detected by ChIA-PET for each cell line, and divided all possible loci pairs into two groups depending on whether they were involved in ChIA-PET detected interactions. The predicted spatial distance between each loci pair was calculated based on the reconstructed 3D chromatin structures. Indeed, we found dramatic difference between loops and non-loops in HSA-derived 3D structures (Fig. 9) in all cell lines tested. RNA PolII mediated loops are significantly closer to each other than non-loops are in 3D (p = 0). Moreover, the difference between the two groups is relatively larger in the 3D chromatin structures reconstructed by HSA, compared to those of BACH, ChromSDE and ShRec3D. The increased performance is especially remarkable on

Application to in situ Hi-C data of eight cell lines
We further applied HSA to the contact maps of seven human cell lines (GM12878, HMEC, HUVEC, IMR90, K562, KBM7 and NHEK) and one mouse cell line (CH12-LX) from a recent in situ Hi-C study [32]. We fitted HSA based on the contact maps of 1-Mb, 100-kb and 25-kb resolution. All fitted contact maps correlate well with the input contact maps at all three resolutions (Additional file 1: Figure S5). To investigate the similarities of the 3D chromatin conformations of different cell types, we overlaid the fitted structures of the seven human cell lines at 1-Mb (Fig. 10), 100-kb (Additional file 1: Figure S6) and 25-kb (Additional file 1: Figure S7) resolution. Strikingly, at all three resolutions, these diverse sets of human cell types display similar global conformations. We further investigated the consistency of the local regions of the fitted structures across the seven human cell lines. Specifically, for each genomic locus and its neighboring 20 loci, we calculated the PCCs and Spearman correlation coefficients between any pair of the local structures of the seven human cell lines within that neighborhood region. We found that over 70 % of genomic loci at 25-kb resolution have PCCs or Spearman correlation coefficients ≥0.7 across all pairwise comparisons of the seven human cell lines, and the percentage goes to more than 90 % at 100-kb resolution (Additional file 1: Figure S8). This suggests that the genome conformations of diverse cell types are conserved, as revealed by the fitted 3D chromatin structures.

Conclusions
We have developed HSA -a novel method for improved chromatin structure reconstruction at the genome scale. Its joint modeling framework has the advantage of combining information from multi-track Hi-C contact maps of different restriction enzymes. The underlying chromatin structure is characterized by a GLM with Markov modeling. HSA searches the model space through an iterative algorithm combining SA with Hamiltonian dynamics, allowing efficient global model exploration.
The proposed method can handle diverse types of inputs of Hi-C data, including both normalized and unnormalized contact maps. It is especially effective for sparse contact maps, which are very common for Hi-C data. It models the local dependence of neighboring loci explicitly by Markov chains. The algorithm showed Fig. 10 Overlay of the 3D conformations of all chromosomes at 1-Mb resolution inferred from in situ Hi-C data for the seven human cell lines substantial improvement when the Hi-C contact map is sparse (say, 10 % signal coverage).
We tested our method through extensive simulations with known underlying structures. We found that our method is more accurate and robust than or comparable to existing methods at various signal coverage levels. We demonstrated that the performance on sparse contact maps is significantly improved by multi-track fitting and Markov modeling.
We applied the proposed HSA method to Hi-C data sets of diverse cell types from humans and mice. We found that our model fits the data better than a number of existing methods. We also employed orthogonal FISH and ChIA-PET data as independent validation of our reconstructed 3D structures. We demonstrated that our method outperforms a number of existing approaches across various cell lines. Importantly, the application of HSA to in situ Hi-C data reveals striking consistency across different human cell types, which suggests there are certain invariant 3D conformations of the genome, despite the dynamic temporal and spatial variations. This finding complements the well-known conservation of the topologically associated domains of the genome [27]. Our study points to two potential directions for further exploration. First, it will be interesting to study the motion of the chromatin as a polymer to understand why and how it generates different 3D conformations across cell types while maintaining a certain invariant topology. Second, our multi-track modeling can be extended to analyze different cell types jointly to extract the principal rules underlying 3D genome folding.
The application of HSA to all chromosomes of a dozen human and mouse cell lines demonstrates the feasibility of genome-scale 3D chromatin structure reconstruction. The running time of HSA remains reasonable up to 25-kb resolution for in situ Hi-C data (∼2000 loci per chromosome). In general, the running time of HSA increases by an order of O(N 2 ) with N as the number of loci, and by an order of O(C) with C as the number of tracks (Additional file 1: Table S4).
Chromatin conformation is known to play essential roles in genome function. High-throughput technologies such as Hi-C are generating genome-scale data sets for dissecting chromatin conformations in various biological conditions. The demonstrated ability of our method applied to diverse organisms and cellular conditions will deepen our understanding of 3D chromatin structure as the basis for regulating cellular functions.

Materials and methods
The HSA algorithm Given a specific region in a genome of interest, we assume there are C contact maps (C ≥ 1) treated by C restriction enzymes. The available loci in the cth track of the maps are defined as those containing the corresponding 6-mer sequence recognized by the cth restriction enzyme with mappability score over a certain cutoff [11].
If the inputs are the raw contact maps containing the counts of paired-end reads for any two loci, we model the read counts between any two available loci of the cth track using a GLM as below: where n i c j c and d i c j c represent the contact frequency and 3D distance between loci i c and j c , respectively. S i c indicates the 3D coordinate of the underlying locus i c . β c1 ln(d i c j c ) reflects the power-law relationship between contact frequency and 3D distance [8] where β c1 is the power-law coefficient. x i c j c k is the kth covariate for bias correction. Following Hu et al. [12], we include enzyme-cutting fragment length, GC content, and mappability as the covariates. The corresponding regression coefficients are denoted by β c0 , β c1 and β ck . Since restriction enzymes have varied cutting sites across the genome, the proposed joint modeling of multiple Hi-C tracks is able to cover more genomic loci by considering the union of available loci from all tracks. For simplicity, we note the coordinates of the ith locus after the union as S i (i ∈ ∪ C c=1 G(c)). When counts are the random variables of interest, Poisson and negative binomial models are the two commonly used approaches. For single-track Hi-C data, existing research has shown that Poisson regression and negative binomial regression have similar performance [12,22]. Thus, for each track c, we employ a Poisson regression model to characterize the counts of sequencing reads for simplicity.
Genomic loci of local proximity have innate correlations as connected residues in a polymer. We characterize the adjacency relationship of neighboring loci by a Gaussian Markov chain hidden in the contact maps to capture the local dependence of genomic loci: where A and b are the coefficients that characterize the transition of coordinates between loci i − 1 and i. is the covariance matrix. The parameters A and b can be chosen empirically to reflect the polymer's helix tendency as a priori information [33]. For simplicity, we set A as an identity matrix I 3 , b as a zero vector, and = (1/λ)I 3 (λ ≥ 0). Denote the union of the genomic loci ∪ C c=1 G(c) as {l 1 , l 2 , . . . , l n } with l 1 = 1 and l i < l i+1 , where n is the total number of loci. Given S l 1 , S l 2 , . . . , S l n , the log-likelihood of our model is: where the first term is the conventional log-likelihood of the GLM under a Poisson link function, the second term is a constant, and the last term reflects a distance penalty. The distance penalty controls the smoothness of the coordinates of neighboring loci with a tuning parameter λ. At the extreme scenario λ = 0, it corresponds to a GLM entirely relying on the contact maps without smoothing. Smoothing is necessary when the contact map is sparse. We set λ = O( √ n) during parameter initialization and λ = 1 in iterations when the density of the contact map is under 10 %, and λ = 0 in other cases.
When the input data are normalized contact maps obtained through a certain bias correction approach such as [11], we replace n i c j c c by the corresponding normalized intensity in the log-likelihood function without bias correction terms.
Parameters in HSA are estimated through an iterative algorithm. We first fit the GLM without the distance power-law term (β c1 ln(d i c j c )) to initialize β ck , k = 1. Then, we sequentially optimize the coordinate S l i based on the locations of its previous min(5, i) loci under the log-likelihood with Markov property to get an initial structure. We then use SA combined with Hamiltonian dynamics to explore the model space under the GLM, and update all coefficients iteratively. SA is a probabilistic method for locating a good approximation to the global optimum in a high-dimensional search space. It has been a popular tool for molecular structure prediction [34,35]. The use of SA with Hamiltonian dynamics allows the efficient global exploration of the model space.
HSA is open-source software available from http:// ouyanglab.jax.org/hsa/. The source code and user manual of HSA are provided under the GNU General Public License (GPL) at http://dx.doi.org/10.5281/zenodo.45514.

Simulated contact maps
We simulated contact maps based on the regular helical structure and the random-walk structure. We generated the (i, j) entry of a contact map as a Poisson-distributed random number n ij . The parameter λ ij of the Poisson distribution is based on the power-law conversion of the distance matrices of the real structures λ ij = c/d α ij . We set α = 1.5 and tuned c to make the signal coverage (the percentage of non-zero entries in a contact map) at 90 %, 70 % and 25 %. According to [14], we simulated uniformly distributed random numbers in (0, 1) for the covariates, including enzyme-cutting fragment length, GC content and mappability, and used them as input for BACH and HSA. For the comparison at 10 % signal coverage, we simulated the λ ij as ij with x ijk = x ik x jk , where x ik (k = 1, 2, 3) were uniformly distributed in (0, 1). We simulated the contact maps at 90 %, 70 % and 25 % signal coverage levels from the regular helical structure specified as: We also simulated the contact maps at 10 % signal coverage levels from the regular helical structure specified as: Finally, we simulated the contact maps for the randomwalk structure as Poisson-distributed random numbers. The Poisson distribution parameter ij and x ijk = x ik x jk , where x ik (k = 1, 2, 3) were uniformly distributed in (0, 1). All the simulated contact maps and structures are available at http://ouyanglab.jax.org/hsa/.

The toy models
The toy models are the very six toy chromosomes constructed by Trussart et al. [26]. The contact maps and the underlying structures were downloaded from http://sgt. cnag.cat/3dg/datasets/. The 3D structures reconstructed by TADbit were obtained directly from the aforementioned website. We applied HSA, BACH and MCMC5C to all the 168 contact maps to obtain their respective reconstructed structures. Denotations of the noise and structural variation levels of the contact maps were kept the same as in the aforementioned website. Specifically, alpha denotes the simulated experimental noise level, whose value is related to the decay of the Gaussian function [26] between the probability of interactions and the 3D Euclidean distances. A set represents the structural variation level. The nth set was generated by extracting 100 conformations separated by a time step of 10 n iterations in the simulation [26].

Hi-C data
The Hi-C data used in our study are from four cell lines: mESC, GM06990, K562 and MCF7. A description of each of these follows: mESC: The mapped reads are accessible at Gene Expression Omnibus (GEO) under the accession number GSE35156. Raw and normalized contact maps at 40-kb resolution [27] were downloaded from [36]. Raw and normalized contact maps at 1-Mb resolution were obtained from [14]. The normalized contact maps were all processed by the approach of [11].
GM06990: The mapped reads are accessible at GEO under the accession number GSE18199. The normalized contact maps at 1-Mb resolution [9] were downloaded from [37]. We used the pipeline of [11] to obtain normalized contact maps at 200-kb resolution.
K562: The mapped reads are accessible at GEO under the accession number GSE18199. We used the pipeline [11] for normalization.
MCF7: The raw data are from [28]. We used the pipeline [11] for normalization.
All raw contact maps were modeled with covariates including enzyme cut fragment length, GC content and mappability calculated according to [12].

FISH data
We obtained the published FISH data sets in mESC [29] from [21]. FISH data in GM06990 [9] were downloaded from GEO under the accession number GSE18199. The average inter-locus distances were used as the reference distance between loci. Different structures were scaled as done in [21]. Specifically, suppose we have p structures and M FISH measured distances: dist i · δ ik , i = 1, . . . , M δ ij = 1if loci pair i is in structure k, and 0 otherwise.
We performed this linear regression without an intercept and used the estimated β k to scale the kth structure. The FISH distances and predicted structures and contact maps are available at http://ouyanglab.jax.org/hsa/ and http://dx.doi.org/10.5281/zenodo.45513.

ChIA-PET data
We obtained the published RNA PolII ChIA-PET data sets in mESC from [30], in K562 from [31] and in MCF7 from [31]. RNA PolII mediated loops within 10 Mb in genomic distance were used as the benchmark. In the ChIA-PET data sets of MCF7 and K562, the original RNA PolII anchors were annotated according to the hg19 reference genome. To make them compatible with the