 METHOD
 Open Access
 Published:
HSA: integrating multitrack HiC data for genomescale reconstruction of 3D chromatin structure
Genome Biology volume 17, Article number: 40 (2016)
Abstract
Genomewide 3C technologies (HiC) are being increasingly employed to study threedimensional (3D) genome conformations. Existing computational approaches are unable to integrate accumulating data to facilitate studying 3D chromatin structure and function. We present HSA (http://ouyanglab.jax.org/hsa/), a flexible tool that jointly analyzes multiple contact maps to infer 3D chromatin structure at the genome scale. HSA globally searches the latent structure underlying different cleavage footprints. Its robustness and accuracy outperform or rival existing tools on extensive simulations and orthogonal experiment validations. Applying HSA to recent in situ HiC data, we found the 3D chromatin structures are highly conserved across various human cell types.
Background
Threedimensional (3D) chromatin conformation plays crucial roles in diverse genome functions, such as transcriptional regulation [1], DNA methylation [2], replication [3] and cohesin binding [4]. Elucidating 3D chromatin conformation can provide a mechanistic understanding of various biological processes and human diseases. Therefore, it is important to capture 3D chromatin conformation and relate it to genome function. 3D chromatin conformation has traditionally been studied by cytogenic methods, such as florescent in situ hybridization (FISH) [5]. Recently, several experimental technologies have been developed to capture chromatin conformations at multiple scales. For instance, the chromosome conformation capture (3C) technique has been used to study chromatin structure in living cells [6]. It derives the circularized chromosome conformation capture (4C) [7], which is able to detect many genomic loci interacting with a DNA region of interest. It is further extended to carbon copy chromosome conformation capture (5C), which allows for largescale detection of 3D chromatin interactions [8]. Further, HiC was introduced to dissect 3D chromatin structure at the genomescale [9]. Another technology, chromatin interaction analysis by pairedend tag sequencing (ChIAPET), detects genomewide chromatin interactions mediated by a protein of interest [10]. These technologies have generated large amounts and diverse types of data. To interpret these data appropriately and advance biological understanding, it is crucial to develop statistically sound approaches to their modeling and analysis.
Here, we focus on HiC for a genomescale analysis of chromatin conformations. HiC data are usually summarized into a contact map, which reflects the physical proximity between pairs of genomic loci at the genome scale. In a HiC contact map, an offdiagonal entry represents the number of pairedend reads spanning two different loci. The complex steps of HiC experiments introduce various biases, such as restriction enzyme cutting, GC content and sequence uniqueness [11]. For instance, HiC employs different restriction enzymes, such as NcoI (recognizing CCATGG) and HindIII (recognizing AAGCTT), which results in different genomic cutting sites and, consequently, contact maps. Some existing efforts are normalizing HiC contact maps to reduce systematic biases buried in the HiC experiments, either parametrically [12] or nonparametrically [11].
One of the most important goals of a HiC data analysis is to reconstruct 3D chromatin structures of the genome. Elucidating the 3D chromatin structure of the genome is important as it improves the mechanistic understanding of various gene regulatory events that are orchestrated in the nucleus of living cells. Also, transforming contact maps to 3D chromatin structures can be regarded as a dimensionreduction (noise filtering) procedure, as the degrees of freedom reduce from O(N ^{2}) to O(3N), where N is the number of genomic loci. The improvement is substantial, especially at the genome scale, as N is typically very large when many loci are involved.
A HiC experiment requires millions of cells. Therefore, chromatin interactions captured by HiC reflect the consensus structural conformation of the whole population of cells. Some existing computational efforts infer the consensus 3D chromatin structure. Some are based on optimization of target functions with prespecified constraints [13], e.g., ChromSDE [14] (employing a semidefinite programming approach), ShRec3D [15] (combining shortestpath distance with multidimensional scaling) and others [16–19]. However, these optimizationbased models may be trapped in local optima, particularly at low signal coverage (the percentage of nonzero entries in a contact map), and do not consider HiC experimental uncertainties. Statistical approaches have been developed to model the uncertainties in HiC experiments explicitly. For instance, MCMC5C [20] models HiC data through a Gaussian model. In this model, there are no bias removal steps, and the Gaussian variance estimate is ad hoc. To overcome these limitations, BACH [21] and PASTIS [22] employ Poisson models combining bias removal with 3D structure reconstruction. Due to limited availability of data, the reliability of these models remains to be tested when reconstructing 3D chromatin structure at the genome scale (for a more comprehensive review, see [23]).
Importantly, all these existing approaches for 3D chromatin structure reconstruction are designed for singletrack HiC data from only one restriction enzyme. It is likely that one can obtain improved 3D models through integrative modeling of multitrack HiC data combining different restriction enzymes. Moreover, few existing methods consider the local dependence of neighboring loci, thus they are sensitive to the sparsity of HiC contact maps. In addition, none of the existing methods has been assessed on a wide range of independent experimental data. Finally, no approaches have been shown to give consistent performance at the genome scale across various cell types. In this paper, we propose a novel approach named HSA, to reconstruct 3D chromatin structures at the genome scale by leveraging multitrack HiC data and modeling the local dependence of neighboring loci explicitly. To our knowledge, this is the first approach integrating multitrack HiC data for 3D chromatin structure reconstruction at the genome scale. We assess HSA extensively through simulations and real applications on HiC data from four cell lines. We also apply HSA to a recent in situ HiC study of eight cell lines. We use orthogonal data sets from FISH and ChIAPET experiments available for the cell lines as independent validations of the reconstructed 3D chromatin structures. The assessments demonstrate improved performance of HSA over a number of existing approaches across different cell lines at the genome scale. The study provides insights on the conservation of 3D chromatin structure across various human cell types.
Results and discussion
Method overview
An overview of HSA is illustrated in Fig. 1. HSA takes one or more HiC contact maps of the same resolution as input to reconstruct a consensus 3D chromatin structure. It utilizes the generalized linear model (GLM) with an iterative algorithm, which combines Hamiltonian dynamics with simulated annealing (SA), a global search strategy to explore the model space. It provides an option of Markov modeling when the contact maps have low signal coverage. The input for HSA can be either raw contact maps with count data or normalized contact maps obtained through existing approaches, such as Yaffe et al. [11]. The details of the HSA method are described in Section “Materials and methods”.
Assessments and comparisons on simulated data
In the simulation, we compared HSA with seven published methods: BACH [21], ChromSDE [14], ShRec3D [15], MCMC5C [20], AutoChrom3D [24], PASITS [22] and TADbit [25]. We used contact maps simulated from a regular helical structure (see Section “Materials and methods” for detailed derivations). We applied the methods to the simulated contact maps (see Additional file 1 for detailed implementation of each method). To test the effect of sparsity in the contact maps on the accuracy of the methods, we simulated contact maps at three signal coverage levels: 90 %, 70 % and 25 %. The true structures and the fitted structures from the eight methods at the three signal coverage levels are shown in Additional file 1: Figure S1. We calculated the Pearson correlation coefficients (PCCs) and the rootmeansquare deviations (RMSDs) between the true structures and the fitted structures for the eight methods (Fig. 2). Specifically, PCC measures the correlation between the real structure and predicted structure across the pairwise spatial distances among all loci. RMSD is calculated as the minimum root mean of squared distances between the 3D coordinates of each loci in the real and predicted structures (see Additional file 1 for detailed derivations). Clearly, HSA outperforms the others with the lowest RMSDs and the highest PCCs at all three signal coverage levels. To demonstrate the advantage of multitrack fitting uniquely implemented in HSA, we used HSA for joint modeling of the contact maps at 70 % and 25 % signal coverages. This multitrack fitting outperforms all its singletrack counterparts and has even better performance than some methods at 90 % signal coverage. This suggests that combining information from multiple contact maps may improve the accuracy of 3D chromatin structure reconstruction. Another unique feature of HSA is the option of Markov modeling. To investigate the utility of this feature, we applied BACH, HSA without Markov modeling, and HSA with Markov modeling on a contact map with 10 % signal coverage. HSA with Markov modeling clearly outperforms the other two with the lowest RMSD and the highest PCC (Additional file 1: Figure S2). We also assessed the eight methods using contact maps simulated from a randomwalk structure at 30 % signal coverage levels (see Section “Materials and methods” for detailed derivations). Again, HSA outperforms all the others with the lowest RMSDs and the highest PCCs (Table 1 and Additional file 1: Figure S3). We found that HSA with Markov modeling becomes increasingly important as the signal coverage goes down from 30 % to 10 % (Additional file 1: Table S1 and Additional file 1: Figure S4). Based on these simulation results, we suggest Markov modeling in HSA when the signal coverages in contact maps are less than 10 %.
The above simulations are based on the consensus structure assumption, i.e., there is only one true structure underlying a contact map. Among the tested methods, BACH, MCMC5C and TADbit are based on the assumption that there is an ensemble of structures underlying a contact map. To compare methods when the true structures are not unique, we employed the toy models used in developing the TADbit method [26]. In each toy model, a contact map was simulated from multiple structures with a certain noise level and structural variation. We applied BACH, MCMC5C and HSA to the contact maps of the toy models. We also extracted the structure predicted by TADbit based on the lowest integrative modeling platform objective function model for each contact map [26]. We then compared the predicted structure of each method to all the underlying structures of each toy model. We calculated the PCCs and RMSDs of each method at each combination of noise levels and structural variations in the toy models. As seen in Figs. 3 and 4, although HSA is a consensusstructurebased model, its performance is comparable to TADbit and better than BACH and MCMC5C on the toy models based on the ensemble structure assumption.
Application to HiC data of four cell lines
We applied HSA to the HiC data of four cell lines: mESC [27], GM06990 [9], K562 [9] and MCF7 [28]. mESC and GM06990 have HiC contact maps of both NcoI and HindIII, while K562 and MCF7 have those of HindIII only. To demonstrate the advantage of integrating information from multitrack HiC contact maps, we fitted HSA for both multitrack and singletrack. For comparison, we also applied BACH [21], ChromSDE [14] and ShRec3D [15] to the same data sets. We fitted HSA and other models using the same inputs. Specifically, we fitted HSA and BACH using the raw contact maps with enzyme cut fragment length, GC content and mappability as covariates for bias correction. We also fitted HSA, ChromSDE and ShRec3D using the normalized contact maps processed by the same pipeline [11], since the latter two do not have an internal bias correction process.
Multitrack and singletrack fittings of HSA result in consistent 3D structures, as shown in Fig. 5. At 200kilobase (kb) resolution, the 3D structures of the entire chromosome 14 of GM06990 show relatively smaller differences when fitted by HSA on NcoI and HindIII contact maps jointly, NcoI contact map only, and HindIII contact map only (Fig. 5 a–c). The lowest value of pairwise PCCs between the three structures is 0.76. The 3D structures fitted by BACH on NcoI and HindIII contact maps exhibit a larger difference (PCC =0.53) with some notable outlier loci (Fig. 5 d, e). ShRec3Dderived 3D structures from NcoI and HindIII contact maps have the largest difference (Fig. 5 f, g, PCC =0.37). ChromSDE was computationally overburdened on the contact maps at 200kb resolution when tested on our computer cluster (Bright Cluster Manager v5.2, CentOS 6.0, 128 GB of RAM per system board).
We then reconstructed the 3D structure of each chromosome at 1Mb resolution to compare the four methods across all four cell lines. To measure how well the fitted 3D structures explain the input contact maps, we transformed the pairwise distances in 3D into a fitted contact map by the powerlaw relationship \(F_{ij} \sim {d_{ij}^{\alpha }}\), where F _{ ij } is the (i,j) entry of the fitted contact map, d _{ ij } is the 3D distance and α is the trackspecific powerlaw coefficient. Specifically, we estimated α as β _{ c1} by the GLM framework of HSA. For BACH and ChromSDE, α was estimated by their respective models. For ShRec3D, we estimated α by fitting a GLM between the normalized contact maps and ln(d _{ ij }) with the Poisson link function. Then, we calculated the PCCs between the input and fitted contact maps for all chromosomes.
As shown in Fig. 6, HSAderived 3D structures fit the input HiC data better than the other three methods across all four cell lines in most cases. Notably, HSA fits equally well on both raw and normalized contact maps, while all the other three methods only work on one input data type. For normalized contact maps, HSA fits are clearly better than ChromSDE and ShRec3D (Fig. 6 b, d, f and h), which is likely due to the superiority of the global search strategy of SA employed in HSA. The increase of goodnessoffit is the largest for MCF7 (Fig. 6 h), in which the contact maps are very sparse (5–13 % signal coverages across chromosomes).
Moreover, multitrack and singletrack fittings of HiC data by HSA result in consistent goodnessoffit, which indicates the robustness of HSA in identifying the same underlying 3D structure probed by different restriction enzymes. We found that the 3D structures derived by multitrack HSA fitting explain the contact maps of NcoI and HindIII equally well. This lies in the ability of HSA to calculate trackspecific powerlaw coefficients for distance transformation when fitting multitrack HiC contact maps. As shown in Fig. 7, HSA derives different powerlaw coefficients for different contact maps, in which NcoI has a smaller powerlaw coefficient than HindIII does across all chromosomes in GM06990. This indicates that simply pooling different contact maps together is suboptimal, as the discrepancy in powerlaw coefficients breaks the additivity. Also note that the powerlaw coefficient has a high variability among chromosomes, which suggests that it might be inappropriate to assume a universal powerlaw coefficient for 3D reconstruction across all chromosomes.
The high correlations between predicted structures and input contact maps indicate our model can explain well the HiC data. But it is not a measure of the method accuracy per se. In the following sections, we sought to use orthogonal data, such as those from FISH and ChIAPET, to validate our predictions with HiC data.
Validations and comparisons using FISH data
We validated the 3D chromatin structures reconstructed from HiC contact maps with independent FISH data available for the cell lines mESC and GM06990. In mESC, FISH probes span a 32Mb region on chromosome 2 and a 65Mb region in chromosome 11 at 40kb resolution [29]. We applied all four methods except for ChromSDE (which ran out of memory) on HiC contact maps at 40kb resolution. HSA was fitted using the raw contact maps of NcoI and HindIII jointly, those of NcoI only, and those of HindIII only. BACH was fitted using the raw contact maps of NcoI only and those of HindIII only. ShRec3D was fitted using the normalized contact maps of HindIII. We then calculated the PCCs between the predicted distances based on the 3D structures and the corresponding FISHmeasured distances between the probed loci pairs. Each FISH locus overlaps with two binned loci in the HiC contact maps. So we tried different combinations (e.g., left–left, left–right, etc.) of the two bins at both ends of the FISHprobed loci pair when calculating the predicted distances and obtained a range of PCCs for each FISH data set. The PCCs of multitrack fitting of HSA are most robust (Fig. 8 a) and significantly higher than those of singletrack HSA on NcoI, BACH on NcoI, and ShRec3D on HindIII (p<0.02 under a righttailed Ttest, Additional file 1: Table S2). This was marginally significant when comparing the PCCs of multitrack HSA with those of singletrack HSA on HindIII (p=0.0619) or BACH on HindIII (p=0.0853), in which the former is mainly due to an outlier while the latter has evidently larger variance (Fig. 8 a). In GM06990, FISH probes span chromosomes 14 and 22 at 200kb resolution [9]. We were able to reconstruct the 3D structures of the entire chromosomes using HSA, BACH and ShRec3D using 200kb resolution HiC maps, while ChromSDE ran out of memory. Again, HSA is more robust and accurate compared to BACH and ShRec3D, in which the PCCs of multitrack HSA are significantly higher than those of the other six approaches (p≤0.0003 under a righttailed Ttest, Additional file 1: Table S3).
Validations and comparisons using ChIAPET data
We further validated the 3D chromatin structures using publicly available ChIAPET data of RNA PolII in mESC [30], K562 [31] and MCF7 [31]. These ChIAPET data provide genomewide chromatin interactions mediated by RNA PolII. We reasoned that the 3D distances between loci pairs with ChIAPET interactions (loops) are smaller than those of noninteracting pairs (nonloops) among the RNA PolII anchors. So we extracted all genomic loci involved in interactions detected by ChIAPET for each cell line, and divided all possible loci pairs into two groups depending on whether they were involved in ChIAPET detected interactions. The predicted spatial distance between each loci pair was calculated based on the reconstructed 3D chromatin structures. Indeed, we found dramatic difference between loops and nonloops in HSAderived 3D structures (Fig. 9) in all cell lines tested. RNA PolII mediated loops are significantly closer to each other than nonloops are in 3D (p=0). Moreover, the difference between the two groups is relatively larger in the 3D chromatin structures reconstructed by HSA, compared to those of BACH, ChromSDE and ShRec3D. The increased performance is especially remarkable on sparse HiC contact maps in MCF7. This indicates that HSA is more precise in reconstructing 3D chromatin structures at the genome scale, as validated by RNA PolII ChIAPET data.
Application to in situ HiC data of eight cell lines
We further applied HSA to the contact maps of seven human cell lines (GM12878, HMEC, HUVEC, IMR90, K562, KBM7 and NHEK) and one mouse cell line (CH12LX) from a recent in situ HiC study [32]. We fitted HSA based on the contact maps of 1Mb, 100kb and 25kb resolution. All fitted contact maps correlate well with the input contact maps at all three resolutions (Additional file 1: Figure S5). To investigate the similarities of the 3D chromatin conformations of different cell types, we overlaid the fitted structures of the seven human cell lines at 1Mb (Fig. 10), 100kb (Additional file 1: Figure S6) and 25kb (Additional file 1: Figure S7) resolution. Strikingly, at all three resolutions, these diverse sets of human cell types display similar global conformations. We further investigated the consistency of the local regions of the fitted structures across the seven human cell lines. Specifically, for each genomic locus and its neighboring 20 loci, we calculated the PCCs and Spearman correlation coefficients between any pair of the local structures of the seven human cell lines within that neighborhood region. We found that over 70 % of genomic loci at 25kb resolution have PCCs or Spearman correlation coefficients ≥0.7 across all pairwise comparisons of the seven human cell lines, and the percentage goes to more than 90 % at 100kb resolution (Additional file 1: Figure S8). This suggests that the genome conformations of diverse cell types are conserved, as revealed by the fitted 3D chromatin structures.
Conclusions
We have developed HSA  a novel method for improved chromatin structure reconstruction at the genome scale. Its joint modeling framework has the advantage of combining information from multitrack HiC contact maps of different restriction enzymes. The underlying chromatin structure is characterized by a GLM with Markov modeling. HSA searches the model space through an iterative algorithm combining SA with Hamiltonian dynamics, allowing efficient global model exploration.
The proposed method can handle diverse types of inputs of HiC data, including both normalized and unnormalized contact maps. It is especially effective for sparse contact maps, which are very common for HiC data. It models the local dependence of neighboring loci explicitly by Markov chains. The algorithm showed substantial improvement when the HiC contact map is sparse (say, 10 % signal coverage).
We tested our method through extensive simulations with known underlying structures. We found that our method is more accurate and robust than or comparable to existing methods at various signal coverage levels. We demonstrated that the performance on sparse contact maps is significantly improved by multitrack fitting and Markov modeling.
We applied the proposed HSA method to HiC data sets of diverse cell types from humans and mice. We found that our model fits the data better than a number of existing methods. We also employed orthogonal FISH and ChIAPET data as independent validation of our reconstructed 3D structures. We demonstrated that our method outperforms a number of existing approaches across various cell lines. Importantly, the application of HSA to in situ HiC data reveals striking consistency across different human cell types, which suggests there are certain invariant 3D conformations of the genome, despite the dynamic temporal and spatial variations. This finding complements the wellknown conservation of the topologically associated domains of the genome [27]. Our study points to two potential directions for further exploration. First, it will be interesting to study the motion of the chromatin as a polymer to understand why and how it generates different 3D conformations across cell types while maintaining a certain invariant topology. Second, our multitrack modeling can be extended to analyze different cell types jointly to extract the principal rules underlying 3D genome folding.
The application of HSA to all chromosomes of a dozen human and mouse cell lines demonstrates the feasibility of genomescale 3D chromatin structure reconstruction. The running time of HSA remains reasonable up to 25kb resolution for in situ HiC data (∼2000 loci per chromosome). In general, the running time of HSA increases by an order of O(N ^{2}) with N as the number of loci, and by an order of O(C) with C as the number of tracks (Additional file 1: Table S4).
Chromatin conformation is known to play essential roles in genome function. Highthroughput technologies such as HiC are generating genomescale data sets for dissecting chromatin conformations in various biological conditions. The demonstrated ability of our method applied to diverse organisms and cellular conditions will deepen our understanding of 3D chromatin structure as the basis for regulating cellular functions.
Materials and methods
The HSA algorithm
Given a specific region in a genome of interest, we assume there are C contact maps (C≥1) treated by C restriction enzymes. The available loci in the cth track of the maps are defined as those containing the corresponding 6mer sequence recognized by the cth restriction enzyme with mappability score over a certain cutoff [11].
If the inputs are the raw contact maps containing the counts of pairedend reads for any two loci, we model the read counts between any two available loci of the cth track using a GLM as below:
where \(n_{i_{c}j_{c}}\) and \(d_{i_{c}j_{c}}\) represent the contact frequency and 3D distance between loci i _{ c } and j _{ c }, respectively. \(S_{i_{c}}\) indicates the 3D coordinate of the underlying locus i _{ c }. \(\beta _{c1}\ln (d_{i_{c}j_{c}})\) reflects the powerlaw relationship between contact frequency and 3D distance [8] where β _{ c1} is the powerlaw coefficient. \(x_{i_{c}j_{c}k}\) is the kth covariate for bias correction. Following Hu et al. [12], we include enzymecutting fragment length, GC content, and mappability as the covariates. The corresponding regression coefficients are denoted by β _{ c0}, β _{ c1} and β _{ ck }.
Since restriction enzymes have varied cutting sites across the genome, the proposed joint modeling of multiple HiC tracks is able to cover more genomic loci by considering the union of available loci from all tracks. For simplicity, we note the coordinates of the ith locus after the union as S _{ i } (\(i \in \cup _{c=1}^{C} G(c)\)). When counts are the random variables of interest, Poisson and negative binomial models are the two commonly used approaches. For singletrack HiC data, existing research has shown that Poisson regression and negative binomial regression have similar performance [12, 22]. Thus, for each track c, we employ a Poisson regression model to characterize the counts of sequencing reads for simplicity.
Genomic loci of local proximity have innate correlations as connected residues in a polymer. We characterize the adjacency relationship of neighboring loci by a Gaussian Markov chain hidden in the contact maps to capture the local dependence of genomic loci:
where A and b are the coefficients that characterize the transition of coordinates between loci i−1 and i. Σ is the covariance matrix. The parameters A and b can be chosen empirically to reflect the polymer’s helix tendency as a priori information [33]. For simplicity, we set A as an identity matrix I _{3}, b as a zero vector, and Σ=(1/λ)I _{3} (λ≥0). Denote the union of the genomic loci \(\cup _{c=1}^{C} G(c)\) as {l _{1},l _{2},…,l _{ n }} with l _{1}=1 and l _{ i }<l _{ i+1}, where n is the total number of loci. Given \(S_{l_{1}},S_{l_{2}}, \dots, S_{l_{n}}\), the loglikelihood of our model is:
where the first term is the conventional loglikelihood of the GLM under a Poisson link function, the second term is a constant, and the last term reflects a distance penalty. The distance penalty controls the smoothness of the coordinates of neighboring loci with a tuning parameter λ. At the extreme scenario λ=0, it corresponds to a GLM entirely relying on the contact maps without smoothing. Smoothing is necessary when the contact map is sparse. We set \(\lambda =O(\sqrt {n})\) during parameter initialization and λ=1 in iterations when the density of the contact map is under 10 %, and λ=0 in other cases.
When the input data are normalized contact maps obtained through a certain bias correction approach such as [11], we replace \(n_{i_{c}j_{c}c}\) by the corresponding normalized intensity in the loglikelihood function without bias correction terms.
Parameters in HSA are estimated through an iterative algorithm. We first fit the GLM without the distance powerlaw term (\(\beta _{c1}\ln (d_{i_{c}j_{c}})\)) to initialize β _{ ck }, k≠1. Then, we sequentially optimize the coordinate \(S_{l_{i}}\) based on the locations of its previous min(5,i) loci under the loglikelihood with Markov property to get an initial structure. We then use SA combined with Hamiltonian dynamics to explore the model space under the GLM, and update all coefficients iteratively. SA is a probabilistic method for locating a good approximation to the global optimum in a highdimensional search space. It has been a popular tool for molecular structure prediction [34, 35]. The use of SA with Hamiltonian dynamics allows the efficient global exploration of the model space.
HSA is opensource software available from http://ouyanglab.jax.org/hsa/. The source code and user manual of HSA are provided under the GNU General Public License (GPL) at http://dx.doi.org/10.5281/zenodo.45514.
Simulated contact maps
We simulated contact maps based on the regular helical structure and the randomwalk structure. We generated the (i,j) entry of a contact map as a Poissondistributed random number n _{ ij }. The parameter λ _{ ij } of the Poisson distribution is based on the powerlaw conversion of the distance matrices of the real structures \(\lambda _{ij} = c/d_{ij}^{\alpha }\). We set α=1.5 and tuned c to make the signal coverage (the percentage of nonzero entries in a contact map) at 90 %, 70 % and 25 %. According to [14], we simulated uniformly distributed random numbers in (0,1) for the covariates, including enzymecutting fragment length, GC content and mappability, and used them as input for BACH and HSA. For the comparison at 10 % signal coverage, we simulated the λ _{ ij } as
with x _{ ijk }=x _{ ik } x _{ jk }, where x _{ ik } (k=1,2,3) were uniformly distributed in (0,1). We simulated the contact maps at 90 %, 70 % and 25 % signal coverage levels from the regular helical structure specified as:
We also simulated the contact maps at 10 % signal coverage levels from the regular helical structure specified as:
Finally, we simulated the contact maps for the randomwalk structure as Poissondistributed random numbers. The Poisson distribution parameter
and x _{ ijk }=x _{ ik } x _{ jk }, where x _{ ik } (k=1,2,3) were uniformly distributed in (0,1). All the simulated contact maps and structures are available at http://ouyanglab.jax.org/hsa/.
The toy models
The toy models are the very six toy chromosomes constructed by Trussart et al. [26]. The contact maps and the underlying structures were downloaded from http://sgt.cnag.cat/3dg/datasets/. The 3D structures reconstructed by TADbit were obtained directly from the aforementioned website. We applied HSA, BACH and MCMC5C to all the 168 contact maps to obtain their respective reconstructed structures. Denotations of the noise and structural variation levels of the contact maps were kept the same as in the aforementioned website. Specifically, alpha denotes the simulated experimental noise level, whose value is related to the decay of the Gaussian function [26] between the probability of interactions and the 3D Euclidean distances. A set represents the structural variation level. The nth set was generated by extracting 100 conformations separated by a time step of 10^{n} iterations in the simulation [26].
HiC data
The HiC data used in our study are from four cell lines: mESC, GM06990, K562 and MCF7. A description of each of these follows:
mESC: The mapped reads are accessible at Gene Expression Omnibus (GEO) under the accession number GSE35156. Raw and normalized contact maps at 40kb resolution [27] were downloaded from [36]. Raw and normalized contact maps at 1Mb resolution were obtained from [14]. The normalized contact maps were all processed by the approach of [11].
GM06990: The mapped reads are accessible at GEO under the accession number GSE18199. The normalized contact maps at 1Mb resolution [9] were downloaded from [37]. We used the pipeline of [11] to obtain normalized contact maps at 200kb resolution.
K562: The mapped reads are accessible at GEO under the accession number GSE18199. We used the pipeline [11] for normalization.
MCF7: The raw data are from [28].
We used the pipeline [11] for normalization.
All raw contact maps were modeled with covariates including enzyme cut fragment length, GC content and mappability calculated according to [12].
FISH data
We obtained the published FISH data sets in mESC [29] from [21]. FISH data in GM06990 [9] were downloaded from GEO under the accession number GSE18199. The average interlocus distances were used as the reference distance between loci. Different structures were scaled as done in [21]. Specifically, suppose we have p structures and M FISH measured distances:
We performed this linear regression without an intercept and used the estimated β _{ k } to scale the kth structure. The FISH distances and predicted structures and contact maps are available at http://ouyanglab.jax.org/hsa/ and http://dx.doi.org/10.5281/zenodo.45513.
ChIAPET data
We obtained the published RNA PolII ChIAPET data sets in mESC from [30], in K562 from [31] and in MCF7 from [31]. RNA PolII mediated loops within 10 Mb in genomic distance were used as the benchmark. In the ChIAPET data sets of MCF7 and K562, the original RNA PolII anchors were annotated according to the hg19 reference genome. To make them compatible with the HiC contact maps, we used the liftover program from the UCSC Genome Browser to obtain the annotations according to the hg18 reference genome. To make the 3D structures reconstructed by different methods comparable, we scaled all pairwise distances among RNA PolII anchors by the maximum distance in each reconstructed 3D structure.
In situ HiC data
In situ HiC data were downloaded from GEO with accession number GSE63525 [32]. We followed the KRnorm way specified in GSE63525_OVERALL_README.rtf in the above GEO site to get intrachromosomal normalized contact maps at 1Mb, 100kb and 25kb resolution.
Ethical approval
No ethical approval was required for this study.
Abbreviations
 3D:

threedimensional
 ChIAPET:

chromatin interaction analysis by pairedend tag sequencing
 FISH:

florescent in situ hybridization
 GEO:

Gene Expression Omnibus
 GLM:

generalized linear model
 HSA/BACH/ChromSDEh:

HSA/BACH/ChromSDE using the contact map of HindIII as the input
 HSA/BACH/ChromSDEn:

HSA/BACH/ChromSDE using the contact map of NcoI as the input
 HSA/BACHh1/h2:

HSA/BACH using the contact map of HindIII replicate 1/replicate 2 as the input
 HSAmulti:

HSA using multiple contact maps as the input
 HSAmultinorm/singlenorm:

HSA using multiple normalized contact maps/a single normalized contact map as the input
 HSAnorm:

HSA using normalized contact maps as the input
 HSAsingle:

HSA using a single contact map as the input
 kb:

kilobase
 PCC:

Pearson correlation coefficient
 RMSD:

rootmeansquare deviation
 SA:

simulated annealing
References
Miele A, Dekker J. Longrange chromosomal interactions and gene regulation. Mol Biosyst. 2008; 4:1046–57.
Dekker J. Gene regulation in the third dimension. Science. 2008; 319:1793–4.
De S, Michor F. DNA replication timing and longrange DNA interactions predict mutational landscapes of cancer genomes. Nat Biotechnol. 2011; 29:1103–8.
Ong CT, Corces VG. Ctcf: an architectural protein bridging genome topology and function. Nat Rev Genet. 2014; 15(4):234–46.
Van Steensel B, Dekker J. Genomics tools for unraveling chromosome architecture. Nat Biotechnol. 2010; 28(10):1089–95.
Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. Science. 2002; 295(5558):1306–11.
Dekker J. The three ‘C’s of chromosome conformation capture: controls, controls, controls. Nat Methods. 2006; 3(1):17–21.
Dekker J, MartiRenom MA, Mirny LA. Exploring the threedimensional organization of genomes: interpreting chromatin interaction data. Nat Rev Genet. 2013; 14:390–403.
Aiden EL, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al.Comprehensive mapping of longrange interactions reveals folding principles of the human genome. Science. 2009; 326:289–93.
Fullwood M, Liu M, Pan Y, Liu J, Xu H, Mohamed Y, et al.An oestrogenreceptor αbound human chromatin interactome. Nature. 2009; 462(7269):58–64.
Yaffe E, Tanay A. Probabilistic modeling of HiC contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat Genet. 2011; 43:1059–65.
Hu M, Deng K, Selvaraj S, Qin Z, Ren B, Liu JS. HiCNorm: removing biases in HiC data via Poisson regression. Bioinformatics. 2012; 28:3131–3.
Serra F, Di Stefano M, Spill YG, Cuartero Y, Goodstadt M, Baù D, et al.Restraintbased threedimensional modeling of genomes and genomic domains. FEBS Let. 2015; 589(20):2987–95.
Zhang ZZ, Li GL, Toh KC, Sung WK. Inference of spatial organizations of chromosomes using semidefinite embedding approach and HiC data. Res Comput Mol Biol. 2013; 7821:317–32.
Lesne A, Riposo J, Roger P, Cournac A, Mozziconacci J. 3D genome reconstruction from chromosomal contacts. Nat Methods. 2014; 11:1141.
Van Berkum NL, LiebermanAiden E, Williams L, Imakaev M, Gnirke A, Mirny LA, et al.HiC: a method to study the threedimensional architecture of genomes. J Vis Exp. 2010;(39):1869.
Duan Z, Andronescu M, Schutz K, McIlwain S, Kim YJ, Lee C, et al.A threedimensional model of the yeast genome. Nature. 2010; 465(7296):363–7.
Tanizawa H, Iwasaki O, Tanaka A, Capizzi JR, Wickramasinghe P, Lee M, et al.Mapping of longrange associations throughout the fission yeast genome reveals global genome organization linked to transcriptional regulation. Nucleic Acids Res. 2010; 38(22):8164–77.
Trieu T, Cheng J. Largescale reconstruction of 3D structures of human chromosomes from chromosomal contact data. Nucleic Acids Res. 2014; 42:52.
Rousseau M, Fraser J, Ferraiuolo M, Dostie J, Blanchette M. Threedimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling. BMC Bioinform. 2011; 12:414.
Hu M, Deng K, Qin Z, Dixon J, Selvaraj S, Fang J, et al.Bayesian inference of spatial organizations of chromosomes. PLoS Comput Biol. 2013; 9:1002893.
Varoquaux N, Ay F, Noble WS, Vert JP. A statistical approach for inferring the 3D structure of the genome. Bioinformatics. 2014; 30:26–33.
Hu M, Deng K, Qin Z, Liu JS. Understanding spatial organizations of chromosomes via statistical analysis of HiC data. Quant Biol. 2013; 1:156–74.
Peng C, Fu LY, Dong PF, Deng ZL, Li JX, Wang XT, et al.The sequencing bias relaxed characteristics of HiC derived data and implications for chromatin 3D modeling. Nucleic Acids Res. 2013; 41(19):183–3.
Serra F, Baù D, Filion G, MartiRenom MA. Structural features of the fly chromatin colors revealed by automatic threedimensional modeling. BioRxiv. 2016:1–29. doi:10.1101/036764.
Trussart M, Serra F, Baù D, Junier I, Serrano L, MartiRenom MA. Assessing the limits of restraintbased 3D modeling of genomes and genomic domains. Nucleic Acids Res. 2015; 43(7):3465–77.
Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, et al.Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012; 485:376–80.
Wang J, Lan X, Hsu PY, Hsu HK, Huang K, Parvin J, et al.Genomewide analysis uncovers high frequency, strong differential chromosomal interactions and their associated epigenetic patterns in E2mediated gene regulation. BMC Genom. 2013; 14:70.
Eskeland R, Leeb M, Grimes GR, Kress C, Boyle S, Sproul D, et al.Ring1B compacts chromatin structure and represses gene expression independent of histone ubiquitination. Mol Cell. 2010; 38:452–64.
Zhang Y, Wong CH, Birnbaum RY, Li G, Favaro R, Ngan CY, et al.Chromatin connectivity maps reveal dynamic promoterenhancer longrange associations. Nature. 2013; 504(7479):306–10.
Li G, Ruan X, Auerbach RK, Sandhu KS, Zheng M, Wang P, et al.Extensive promotercentered chromatin interactions provide a topological basis for transcription regulation. Cell. 2012; 148:84–98.
Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, et al.A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014; 159(7):1665–80.
Natta G. From stereospecific polymerization to asymmetric autocatalytic synthesis of macromolecules. Rubber Chem Technol. 1965; 38:37–60.
BassolinoKlimas D, Tejero R, Krystek SR, Metzler WJ, Montelione GT, Bruccoleri RE. Simulated annealing with restrained molecular dynamics using a flexible restraint potential: Theory and evaluation with simulated NMR constraints. Protein Sci. 1996; 5:593–603.
Stenzel O, Westhoff D, Manke I, Kasper M, Kroese DP, Schmidt V. Graphbased simulated annealing: A hybrid approach to stochastic modeling of complex microstructures. Model Simul Mater Sci Eng. 2013; 21:55004–21.
mESC HiC Data. http://chromosome.sdsc.edu/mouse/hic/download.html. Accessed date 24 Nov 2013.
GM, 06990 Normalized HiC Data. http://compgenomics.weizmann.ac.il/tanay/?page_id=283. Accessed date 14 Mar 2014.
Acknowledgments
We thank Edison Liu for inspiring discussions, and Charles Lee for helpful suggestions. We also thank Mei Xiao and Zhonghui Tang for technical assistance. The work is supported by a faculty startup fund from the Jackson Laboratory for Genomic Medicine (to ZO), the Research Starter Grant in Informatics from PhRMA Foundation (to ZO), and the CHIP Faculty Affiliate Seed Grant at UConn (to YZ). We are thankful for travel support from the Statistical and Applied Mathematical Sciences Institute (SAMSI) for attending workshops.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
ZO conceived the project. CZ and ZO designed and performed the research. CZ, YZ and ZO analyzed data. All authors wrote, read and approved the final manuscript.
Additional file
Additional file 1
Supplementary materials, including additional figures and tables not shown in the manuscript. (PDF 13 MB)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zou, C., Zhang, Y. & Ouyang, Z. HSA: integrating multitrack HiC data for genomescale reconstruction of 3D chromatin structure. Genome Biol 17, 40 (2016). https://doi.org/10.1186/s1305901608961
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1305901608961
Keywords
 HiC
 3D chromatin structure
 Multitrack modeling
 Markov chain
 Simulated annealing