HiFive: a tool suite for easy and efficient HiC and 5C data analysis

The chromatin interaction assays 5C and HiC have advanced our understanding of genomic spatial organization, but analysis approaches for these data are limited by usability and flexibility. The HiFive tool suite provides efficient data handling and a variety of normalization approaches for easy, fast analysis and method comparison. Integration of MPI-based parallelization allows scalability and rapid processing time. In addition to single-command analysis of an entire experiment from mapped reads to interaction values, HiFive has been integrated into the open-source, web-based platform Galaxy to connect users with computational resources and a graphical interface. HiFive is open-source software available from http://taylorlab.org/software/hifive/. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0806-y) contains supplementary material, which is available to authorized users.

All data processing and normalization using HiFive was performed using version 1.1.3.

5C filtering
5C read data were imported directly from the BAM alignment files and reads that either had a single mapped end or mapped to same-orientation probes were discarded. For each dataset, HiFive's iterative filtering was performed using a cutoff of 20 interactions per fragment and a minimum interaction size of 50 kilobases (Kb). The iterative filtering was accomplished as follows. For each valid (not removed from analysis) fragment, the number of non-zero interaction pairings longer than the size cutoff was found. Fragments with fewer interactions than the cutoff were removed. This was repeated until all fragment met the minimum interaction criterion.

5C distance dependence function estimation
The distance dependence estimation function for each 5C dataset was found using a power-law relationship [3], with parameter values derived from a linear regression between the log-transformed interaction sizes and the logtransformed observed reads for each valid (unfiltered) non-zero fragment pairing (Supplemental Fig. 5). So for an interaction between forward fragment i and reverse fragment j in region n, the expected distance dependent signal is defined as: with the global and regional mean log-transformed interaction signals µ global and µ n , respectively, and the slope parameter from the linear regression γ. The value µ global is the mean value of all valid log-transformed counts across the set of all regions N (2).
The value µ n is used to rescale correction parameters to have a mean correction of zero (3).
This parameter has a value of zero until normalization is performed using either the Express or Probability algorithms, at which time it is calculated and the fragment correction parameters are adjusted (4).

5C normalization -Probability
5C data normalized using HiFive's Probability algorithm were modeled using a lognormal distribution. Each region's correction values were learned independently. Prior to learning, correction values were initialized as the square root of the mean difference of log-transformed non-zero interaction values and distance-dependence signals (5).
Correction values were found using backtracking line gradient descent. Learning continued for a maximum of 1000 iterations and terminated early if the maximum correction parameter gradient fell below 5e-4. The expected value for each logtransformed interaction was calculated as the sum of its predicted distancedependent signal, the correction value for the first fragment, and the correction value for the second fragment (6).
Because counts were discrete, the cost C function was not strictly calculated as originating from a lognormal distribution but instead was found as follows: where is the cumulative probability function for the standard normal distribution, is the probability density function for the standard normal distribution, and σ is the standard deviation prior to normalization (8).
I is an indicator function (9).
This indicator function is necessary to deal with the limited precision of floatingpoint values and the inability to resolve small differences in the standard normal CDF function for large values of c. For each iteration t, the learning rate r was set to 0.01 and then decreased by 50% until the Armijo value (10) fell below zero, indicating a sufficiently advantageous update of the correction parameters or twenty updates were performed on the learning rate.

5C normalization -Express
5C data normalized using HiFive's Express and ExpressKR algorithms were corrected using a matrix balancing approach. Each region was handled independently (intra-regional interactions only). For ExpressKR learning, estimated distance dependent signals were calculated for all non-zero interactions. For the standard Express algorithm, corrections were iteratively calculated as follows for fragment i with valid non-zero interactions Ai (11).
Learning was run for 1000 iterations. 8 The ExpressKR algorithm used the matrix balancing algorithm described by Knight and Ruiz [4]. All non-zero counts were log-transformed prior to learning and the estimated distance dependent signal was subtracted from each. In order to achieve convergence, a pseudo-count of one was added to the matrix diagonal after log-transforming values. Learning was run until the residual fell below 1e-12.
Subsequent to learning correction values using either algorithm, values were adjusted so that the mean correction adjustment across all valid forwardreverse combinations for a region was zero (4).

5C normalization -Binning
5C data normalized using HiFive's Binning algorithm were corrected using an adaptation of Yaffe and Tanay's approach [5]. Each region was handled independently (intra-regional interactions only). The model used two features, fragment length and GC content. GC content was calculated as the percentage of guanine and cytosine in sequence-specific probe and spacer sequences of each primer. Both model features were partitioned into five intervals such that each interval contained an equal number of fragments regardless of probe orientation.
Log-transformed reads were modeled as arising from a normal distribution with a mean for each interaction equal to the sum of the bin corrections P corresponding the fragment pair with the forward and reverse fragments falling in feature intervals a and b, respectively, for each feature k across the total set of features K and the estimated distance dependent signal (12).
Seed values for feature corrections were calculated as the mean of the logtransformed reads minus distance-dependence predictions for each feature bin divided by the number of observations in the bin (13).
Learning was accomplished iteratively with each feature correction values being optimized independently for each iteration. Learning continued for a maximum of 1000 iterations and was stopped early if the log-likelihood changed by less than 1.0 for a given iteration. Optimization of correction values was done using the Broyden-Fletcher-Goldfarb-Shanno algorithm.

HiC filtering
Reads from aligned HiC data were loaded directly from BAM files, discarding reads that had a single mapped end. Reads were then assigned to fragment ends (fends) based on mapping within a fragment's boundaries and the orientation of alignment. Reads that mapped outside the first or last restriction enzyme (RE) cut site were discarded. Read end pairs that had a total distance sum between alignment coordinates and their respective downstream RE cut site (insert size) that was less than 500 bp were discarded. For cases with multiple reads mapping to the same pairs of coordinates, only one read was kept. In addition, reads that originated from the same fragment or that originated from adjacent fragments and had opposite orientations were also discarded (Supplemental Fig. 4).
For each dataset, HiC fends were filtered using HiFive's iterative filtering using a cutoff of 10 interactions per fragment and an interaction size minimum of 500 Kb. For each valid (not removed from analysis) fend, the number of non-zero interactions greater than 500 Kb in size with other valid fends was counted.
Fends with fewer interactions than the cutoff were removed. This was repeated until all fends met the minimum interaction criterion. The one exception was data for the human GM12878 MboI dataset normalized with HiFive's probability algorithm had an upper range limit of 10 megabases (Mb) on interaction sizes in addition to the lower limit for fend filtering. This was done to ensure all fends would have sufficient numbers of interactions for normalization since the same size limit was also employed then.

HiC distance dependence function estimation
HiC distance dependence estimation functions were calculated for each dataset using a piecewise linear approximation (Supplemental Fig. 6). Each genome's range was partitioned into 100 bins with the smallest bin covering interactions ranging from 0 to 1000 bp. The 99 remaining bins covered the range from 1001 bp to the largest possible interaction size (last fend midpoint minus the first fend midpoint of chromosome 1). This range was partitioned such that the log-transformed distance intervals each bin covered was equal with upper and lower limits for bin n denoted by U n and L n , respectively. For every possible valid fend combination between fends i and j, defined as set A, the log-transformation of the distance d ij , and the observed count c ij (set to one if greater than zero for the binary version of the function). The estimated distance dependence signal is calculated as falling on the line intersecting the nearest two bins as determined by interaction distance (14-16).
where µ m is the chromosome mean correction adjustment which is used to give each intra-chromosomal Express and Probability algorithm correction pairing (excluding self-interactions) a mean value of 1 for chromosome m and correction parameter f i (17).
The Binning algorithm corrections do not require this adjustment because they are calculated globally so µ m is given a value of one.

HiC normalization -Probability
HiC data normalized using HiFive's Probability algorithm were modeled using a binomial distribution. The Poisson distribution is also available within HiFive as a distribution model but was not used due to poorer performance (see  (18).
Correction values were found using backtracking line gradient descent. Learning was continued for a maximum of 1000 iterations and terminated early if the maximum absolute correction parameter gradient fell below 5e-4. For each iteration t, the learning rate r was set to 1.0 and the Armijo value was calculated .
If the Armijo value was greater than zero, the learning rate was in half and the new cost and Armijo value was calculated.
After learning correction values for a chromosome, µ m was calculated and the square root of this mean was divided from the correction values, centering them such that the mean correction equaled one (20).

HiC normalization -Express
HiC data normalized using HiFive's Express and ExpressKR algorithms were corrected using a matrix-balancing approach. Each chromosome was handled independently (intra-chromosomal interactions only) and only interactions with an interaction distance greater than 500 Kb were used for calculations. Prior to learning correction values, estimated distance dependent signals were calculated for all non-zero interactions except for the data marked "ExpressKR" in Figure 4 and Supplemental Figure 13, which were given the estimated signal of one. For the standard Express algorithm, corrections were iteratively updated (21) over 1000 iterations and terminated early if the maximum absolute adjustment value fell below 5e-6. 14 The ExpressKR algorithm used the matrix balancing algorithm described by Knight and Ruiz [4]. Because this approach requires a complete and symmetric matrix, values falling below the distance cutoff were given the value of zero. Learning was run until the residual fell below 1e-12. For the sample labeled "ExpressKR w/distance" in Figure 4 and Supplemental Figure 13, the estimated distance dependence signal was divided from the counts prior to learning.
For both Express approaches, after learning correction values for a chromosome µ m was calculated and the square root of this mean was divided from the correction values (20).

HiC normalization -Binning
HiC data normalized using HiFive's Binning algorithm were corrected partitioned into 20 non-overlapping intervals such that each interval contained an equal number of fends. Mappability was partitioned into 10 non-overlapping intervals spanning equal mappability ranges. The prior probability P prior was calculated as the mean number of observations across all bins (22).
Seed values for each feature correction P for feature k, bins a and b were calculated as the mean number of observations in that bin combination divided by the prior probability (23).
Mappability corrections were not optimized after seed values were calculated.
Reads were modeled as arising from a binomial distribution with an expected value for an observation defined as the product of the prior probability and each feature correction of the total set of features K with bins a and b corresponding to the observation fends (24).
Learning was accomplished iteratively with each feature's correction values being optimized independently for each iteration. Learning was carried out for a maximum of 1000 iterations and was stopped early if the log-likelihood changed by less than 1.0 for a given iteration. Optimization of correction values was performed using the Broyden-Fletcher-Goldfarb-Shanno algorithm.
For samples normalized for the pseudo-counts analysis, the specified number of counts was added to each bin, observed and possible, for each combination of ranges corresponding to that bin. For example, the bin for fend length interval one by fend length interval two had two times the pseudo count added, one for interactions in which the first fend fell in interval one and the second fend fell in interval two, and the second pseudo-count for interactions in which the first fend fell in interval two and the send fend fell in interval one.

HiCPipe HiC normalization
Normalization of HiC data using HiCPipe v0.9 was carried out as described in Yaffe and Tanay [5]. To match HiFive's range, fends were only included within the range of each chromosome's first and last RE cut site. Read data were exported from HiFive for each dataset so all filtering based on read mapping and valid fend combinations were applied but fend coverage filtering was not performed on these data for HiCPipe normalization. Fends were instead filtered by mappability, marking fends with less than 50% mappability as invalid.
Normalization was performed using a three-feature model, fragment length, GC content, and mappability. GC content and mappability were defined as described in 1.2.10 HiC normalization -binning. The model used 20 bin ranges for fragment length and GC content, each partitioned to include equal numbers of valid interactions. Mappability was partitioned into five bin ranges, each spanning a mappability range of 10%, from 50% to 100%. Fragment length and GC content corrections were optimized while mappability corrections were held constant.

HiCNorm HiC normalization
HiC normalization was performed using HiCNorm as described by Hu et al.
[6] with the following changes. Code was adapted to python and implemented using the GLM generalized linear model function from the package 'statsmodels' instead of R. This was done for speed and memory considerations and was confirmed to give identical results. Data and features were as described in 1.3.1 HiCPipe HiC normalization.
HiCNorm normalization for speed and memory usage used the original HiCNorm code rather than our adapted code. The only exception was for the binning of data, as there are no provided scripts for performing these operations.
This was done in R to alleviate the need to load data from text files.

HiCLib HiC normalization
HiC data were normalized using the latest available version of HiCLib (obtained from the development repository on 04-20-15 [7]). Data were loaded directly from BAM mapped read files and filtered for PCR duplicates, a 500 bp insert size, and removing the lowest 0.5% of fends ranked by numbers of interactions. Data were normalized for 20 iterations.

Matrix-balancing HiC normalization
HiC data was normalized using a matrix balancing approach similar to that described by Rao et al. [8]. Data were loaded from HiFive, so all filtering described under1.2.6 HiC filtering was applied prior to normalization.
Normalization was done on a per chromosome basis at fend level resolution using a python adaptation of the algorithm described by Knight and Ruiz [4]. Data were processed as binary observations rather than counts.

HiC probability model performance
In order to determine which probability model gave better results, we performed analysis on the mouse ESC datasets using a Poisson distribution as the underlying probability model in addition to the binomial model described above. Other than the cost and gradient equations, all other aspects of the analysis were identical to that described for the binomial probability normalization. Model performance was assessed using inter-dataset correlations (see Supplemental Methods: HiC dataset correlations).
Supplemental Figure 7 shows the dataset correlation differences between models, demonstrating an advantage of the binomial model across most interaction size ranges and bin sizes. The only cases where the Poisson distribution showed better correlation between datasets were mid-range interaction sizes for the 250 Kb and 1 Mb binned data and for the overall interchromosomal correlation for 1 Mb binned data. The gains in these cases were small compared to the improvements seen using the binomial model across all other cases. Of particularly strong contrast was the effect that model choice had on data binned in smaller bins.
These results were particularly surprising to us, as the binomial model requires binary data rather than integers. Because HiC data are integer counts of observed reads, the Poisson distribution seems a good fit, but the noise or substructure appear to confound finding appropriate normalization corrections. In addition, counts data appear to converge with the binary representation of the data at longer ranges (Supplemental Fig. 6).

5C-HiC data correlations
The Pearson correlation between 5C data and HiC data for each 5C dataset was obtained across all regions for all log-transformed, non-zero, fragment-corrected 5C counts. The 5C data were compared to HiC data that was

HiC normalization runtime comparison
An abbreviated dataset consisting of only interactions for which one end mapped to chromosome 1 from the mouse NcoI dataset was created, along with a corresponding RE cut site bed file and fend feature file. Each stage of analysis for each method was run separately and timing was accomplished using the 'time' Linux command to obtain wall-clock runtimes. Each run was repeated five times and the median value was used for the analysis. All analyses were run on a single 2.294 GHz Quad-Core AMD Opteron Processor running Debian v3.2.4-1.

HiC normalization memory usage comparison
Memory usage was tested exactly as described under 1.4.3 HiC normalization runtime comparison except using the maximum resident set size value (Supplemental Fig. 13). This analysis should be viewed with some skepticism. It is unclear how things such as memory garbage collection and transfer to the swap disk affect the reported RAM usage, so while these numbers may be relatively proportional, it is unclear how accurate some or all of the values are. Thus we urge caution in interpreting the memory usage data. That being said, it appears that HiCPipe is the most memory efficient, followed by HiFive (all but the probability algorithm) and then HiCLib. HiFive's probability algorithm appeared to use about an order of magnitude more RAM than other methods, although this is unsurprising given the modeling of all interactions. The most memory-intensive was HiCNorm, using twice as much RAM at its peak than HiFive probability. A survey of strategies for assaying chromatin interactions. Supplemental Figure 5 -5C distance function.
All non-zero interaction counts for mouse ES cells, replicate one before and after fragment corrections are applied. Interactions were binned in a 200 by 200 grid for display. The red line shows the best-fit linear regression of interaction logtransformed counts as a function of inter-fragment log-transformed distances. The piecewise linear approximation of the distance dependence relationship between HiC counts and interaction distances. The function calculated with numbers of reads shown in black while the function calculated using a binary indicator of observed/unobserved is shown in red. The graph to the right shows individual line segment approximations in alternating colors corresponding to the red box. Differences in inter-dataset correlations between data normalized using a Poisson distribution for modeling noise and data normalized using a binomial distribution for modeling noise. a) Correlations across mutually-exclusive interaction size ranges for data binned at four different resolutions. b) Correlation differences for entire set of intra-(cis) or inter-chromosomal (trans) interactions. Differences in inter-dataset correlations between data normalized using all interaction size ranges and data normalized using only interactions larger than 500 Kb. a) Correlations across mutually-exclusive interaction size ranges for data binned at four different resolutions. b) Correlation differences for entire set of intra-(cis) or inter-chromosomal (trans) interactions. Differences in correlation of 5C-HiC data between 5C data normalized using all interaction sizes and data normalized using only interactions greater than 50 Kb. HiFive normalization of 5C data and its correlation to corresponding HiC data normalized using HiCPipe. a) Correlation of 5C data (intra-regional only) with the same cell type and bin-coordinates in HiC data for two different datasets and using each of HiFive's algorithms. b) Heatmaps for a select region from each dataset, un-normalized, normalized using HiFive's probability algorithm, and the corresponding HiC data, normalized using HiCPipe and dynamically binned. Mouse data were normalized using HiFive-Express with 0, 1, 3, or 6 pseudocounts added to each fend-feature bin in the binning model prior to normalization. Results for HiCPipe are also shown for comparison. a) Interactions were broken down into ten groups of non-overlapping cis interaction ranges for four resolution levels. b) Overall cis interaction correlations for each pseudo-count level are shown across all bin size ranges. c) Overall trans interaction correlations for the larger two bin sizes. Trans interactions were not considered for bins smaller than 250 Kb. Data analyzed using HiFive-express either with or without the estimated distance-dependence signal removed prior to normalization. Interactions normalized using raw counts are shown in black while interactions that were adjusted for the predicted distance-dependent signal prior to normalization are shown in red. a) Correlation of datasets across all interaction ranges for different binning resolutions including interactions from intra (cis) or inter-chromosomal (trans) interactions. b) Correlations between mouse HiC datasets produced using two different restriction enzyme, binned at four resolutions and subdivided into a series of ten interaction size ranges.