Overview of HiNT
HiNT has three main components. HiNT-PRE performs preprocessing of Hi-C data and computes the contact matrix, which stores contact frequencies between any two genomic loci. HiNT-CNV and HiNT-TL start with a Hi-C contact matrix and predict copy number segments and interchromosomal translocations, respectively (Additional file 1: Fig. S2).
HiNT-PRE aligns read pairs to the genome using BWA-MEM [15] and creates a Hi-C contact matrix. The matrix is constructed from normal read pairs (non-chimeric reads that map uniquely to the genome) as well as unambiguous chimeras [16] (Fig. 1a). The latter is a product of Hi-C ligation and is defined as a read pair in which one chimeric read is split into locus A and locus B and the other read is uniquely mapped to locus B (Fig. 1a). All other read pairs containing split reads are defined as ambiguous chimeras [16], which will be used for translocation breakpoint detection (Fig. 1a).
HiNT-CNV (Additional file 1: Fig. S2) first creates a one-dimensional (1D) coverage profile across the genome by calculating row or column sums of the contact matrix at a fixed resolution, e.g., 50 kb. These sums should be correlated with the copy number across the bins since they correspond to the strength of interaction of that region with all other regions. It is critical to use the unnormalized contact matrix here because the matrix-balancing normalization (setting the sum of each row or column to be 1), which is the most widely used Hi-C normalization approach, removes not only biases but also copy number information. The next step is to perform further adjustment to remove other biases that are inherent in the Hi-C experiments, such as GC content, mappability, and restriction site frequency. In Fig. 1b, we see that, without additional adjustment, the 1D profiles for K562 (human chronic myelogenous leukemia cell line; known to have high genomic instability) and GM12878 (human lymphoblastoid cell line) show similarity to each other but not with the copy number profiles estimated from WGS. However, when we remove Hi-C internal biases in K562 by using GM12878 as a control (Fig. 1b, right), the 1D coverage profile becomes highly correlated with the (ploidy-adjusted) copy ratios estimated from WGS data. This result shows that proper normalization is essential in extracting copy number information from Hi-C data. Given that an appropriate control is often unavailable, HiNT-CNV uses a generalized additive model to remove the biggest sources of bias: GC content, mappability, and restriction fragment length (see Methods) [17, 18]. The boundaries of CNV segments are determined using the BIC-seq segmentation algorithm, which utilizes the Bayesian information criterion to identify regions with enriched or depleted read counts [19]. We used the latest version BIC-seq2 [20] that does not require a matched control. It is important to tune the parameter λ in BIC-seq2 to achieve the desired level of smoothness in the CNV profile. Other CNV segmentation algorithms may also be substituted in place of BIC-seq2.
HiNT-TL (Additional file 1: Fig. S2) detects translocations by analyzing normalized interchromosomal interaction matrices. In general, contact probabilities between two regions on the same chromosome decrease monotonically with distance, and interchromosomal interactions are considerably less frequent compared to intra-chromosomal ones. When an interchromosomal translocation occurs, we expect the contact probabilities in two opposite quadrants around the breakpoint to be elevated to the levels observed for adjacent chromosomal regions (Fig. 1c). Thus, HiNT-TL identifies candidate translocated chromosomal pairs based on the presence of high contact probabilities and their unequal distribution. To identify exact breakpoints, HiNT-TL first identifies the breakpoint regions with a coarse 100-kb resolution from the 1D profiles (see “Methods”). HiNT-TL then uses Hi-C ambiguous chimeric reads located within these regions to refine breakpoints to single base-pair resolution.
CNVs predicted by HiNT from Hi-C are consistent with those identified from WGS
To predict CNVs, we first calculate the coverage profile throughout the genome at 50 kb resolution. We then correct for Hi-C biases such as GC content, mappability, and the number of restriction sites (given a fixed bin size, the number of expected fragments depends on the number of cut sites by the restriction enzyme used). To model the non-linear correlation between 1D coverage and biases observed (Additional file 1: Fig. S3), we use a generalized additive model (GAM) with the Poisson link function. GAM is an ideal framework here, as it allows non-parametric fitting with relaxed assumptions on the relationship between predictor and response variables. The copy number information is extracted from regression residuals by the following equation:
$$ \log \left(\mathrm{Coverage}\right)={s}_1\left(\mathrm{GCcontent}\right)+{s}_2\left(\mathrm{Mappability}\right)+{s}_3\left(\mathrm{NumberOfRestr}.\mathrm{Sites}\right)+\varepsilon $$
where si (i = 1, 2, 3)(•) is an unspecified function estimated for each predictor variable and ε is the residual. The model fits better for GM12878 (R2 = 0.798) than for K562 (R2 = 0.631), since K562 is known to have more SVs.
To evaluate CNVs identified from Hi-C, we compare the log2 copy ratios along the genome from the model above with those estimated from WGS. For K562, we see that copy number alterations are prevalent and that the log ratios from Hi-C and WGS are mostly concordant (Fig. 2a, Additional file 1: Fig. S4A; Spearman correlation = 0.82). For GM12878, the correlation is lower (Spearman correlation = 0.21) because there are very few CNVs in this cell line, and the existing small ones are detected only from WGS (Additional file 1: Fig. S4B, Additional file 1: Fig. S5A). The copy ratios fluctuate more in the Hi-C profile relative to WGS data (Fig. 2a, Additional file 1: Fig. S5A) due to the different read depth and possibly due to Hi-C biases that may not have been captured by our model. When the copy number log ratios are segmented using BIC-seq [19], the concordance between the platforms is striking (rows 2 and 3 in Fig. 2b), with ~ 85% and 92% of the large (> 2 Mb) segments from Hi-C overlapping those from WGS in K562 and GM12878, respectively (Fig. 2c, Additional file 1: Fig. S5D; our definition of overlap is described in Additional file 1: Fig. S5C). The copy number profile from array comparative genomic hybridization (CGH) data obtained from Zhou et al. [21] is also mostly concordant (row 1 in Fig. 2b).
To ensure that our results are generalizable, we applied HiNT-CNV to five more cell lines: Caki2 (human renal cancer cell line), LNCaP (human prostate adenocarcinoma cell line), MCF7 (human breast cancer cell line), PANC-1 (human pancreatic cancer cell line), and CHM13hTERT (an effectively haploid cell line, abbreviated as CHM13). Our results show that copy number profiles estimated by HiNT agree well with those inferred from WGS, with a Spearman correlation of ~ 0.8 (Additional file 1: Fig. S6-10A) in most cells. The low correlation (0.4) in LNCaP cells may due to the poor quality of its Hi-C data [22]. More than 80% of large CNV segments identified by HiNT are supported by those identified from WGS in most cell lines (exact overlaps are described in the figure legends of Additional file 1: Fig. S6-10B,C). Collectively, our analysis suggests that HiNT is a reliable tool for identifying large-scale CNVs in both cancer and normal Hi-C data.
HiNT outperforms HiCnv and OneD for identifying CNVs from Hi-C data
We compared the performance of HiNT to that of two other algorithms. HiCnv [23] infers copy number from normalized Hi-C coverage by employing kernel density smoothing followed by a hidden Markov Model; however, it also requires a baseline chromosome copy number from WGS or karyotyping to determine the true copy number of each chromosome. OneD [24] estimates copy number via a hidden Markov model on the corrected contact frequencies obtained from a generalized additive model. When we compare the copy number profiles generated by HiCnv and OneD to those derived from WGS, we find that they are largely discordant. The Spearman correlations of log2 copy ratios inferred from HiCnv and WGS are 0.67 in K562, 0.1 in GM12878, and 0.03 in CHM13 (Additional file 1: Fig. S4C-F, Additional file 1: Fig. S10A). Moreover, only 44.74%, 27.64%, and 70% of the large CNV segments detected by HiCnv overlap those identified from WGS in K562, GM12878, and CHM13, respectively (Fig. 2b, c, Additional file 1: Fig. S5B,D, Supp. Fig, 10B-C). The concordance between HiCnv and WGS is better in Caki2, LNCaP, MCF7, and PANC-1, but it is still less than that observed for HiNT (Additional file 1: Fig. S6–9; the exact correlations and overlaps have been labeled in the figures or figure legends). For OneD, the copy number log ratios are largely discordant with WGS in all cell lines except CHM13, with the correlation between 0.3–0.5 and only ~ 50% of the large CNV segments agreeing with those inferred from WGS (Fig. 2b, c, Additional file 1: Fig. S4E-F, Additional file 1: Fig. S5–10, the exact correlations and overlaps are in the figures or legends).
In addition, input to HiCnv must be either HiC-Pro [25] output or a SAM file, which is then converted to HiC-Pro format, incurring high computational cost for terabyte-scale datasets. For example, 3 billion read pairs result in a ~ 600 GB BAM file, and the required SAM format is at least fourfold larger than BAM format in size. In contrast, HiNT-PRE accepts FASTQ and BAM files and generates the Hi-C contact matrix in hic [16, 26] or cool [27] format, which serves as the input to HiNT-CNV. Both hic and cool are efficient and widely used formats for genomic interaction matrices. Taken together, HiNT-CNV outperforms these existing tools in detecting CNVs in both cancer and normal cell lines in both accuracy and usability.
HiNT accurately identifies translocated chromosomal pairs
Translocations modify the 3D organization of the genome, and they will be incorrectly identified as long-range interactions in Hi-C data if they are not accounted for properly. To first study their impact on Hi-C interaction maps, we developed a simulation scheme to recapitulate the effect of translocations, encompassing homozygous/heterozygous and balanced/unbalanced translocations. A balanced translocation is an even exchange of segments between chromosomes without genetic information gain or loss; an unbalanced translocation involves a loss or gain of chromosome segments. As observed in previous studies [23, 28, 29], a balanced translocation forms a “butterfly” appearance in the chromosomal interaction map (Fig. 3a and Fig. 3b middle, marked by red circles). In contrast, an unbalanced translocation only has a single block (Fig. 3a and Fig. 3b, right column, marked by red circles) [28]. Detection of intra-chromosomal translocations is complicated by the presence of chromatin structures such as TADs and loops. Therefore, we focus on identification of interchromosomal translocations.
Our method is based on detection of two characteristics. First, the contact frequencies should be distributed unevenly around the translocation breakpoint. For this, we utilize the Gini index, a statistical measure of distribution initially used to quantify income inequality in economics [30]. To compute this index, we estimate the cumulative distribution of contact frequencies in each square of the interaction map (we use 1 Mb × 1 Mb) and determine how much it deviates from a linear increase (see “Methods”). A high index corresponds to a more uneven distribution of interaction strength. Second, the maximum interaction level surrounding the breakpoint should be high for a translocation. Regions without a translocation but with a high noise level may satisfy the first criterion of uneven contact frequencies, but their maximum interaction level would not be large. Combining the two features (interaction level and evenness), we define the rank product score as RPi = (Rgini,i/n) ∗ (Rmif,i/n), where Rgini,i and Rmif,i are the ranks of matrix i based on Gini index and maximum interaction frequency, respectively, and n is the total number of interchromosomal interaction matrices.
The rank product score performs well in simulated data, separating the translocated and non-translocated cases in nearly all cases (Additional file 1: Fig. S11). For real data, we found that direct application of the rank product was insufficient, due to the various factors that are not captured by the normalization step, e.g., the A/B compartment effect and the increased interactions between small chromosomes or between sub-telomeric regions. To eliminate such biases, we created a background interaction matrix by averaging the matrices from five normal cell lines (Additional file 2: Table S1, see “Methods”) and used it to normalize the original matrix. In Fig. 4a, we show three examples of chromosomal pairs in K562 data whose scores change as a result of the additional normalization. In the first case (chr1- chr21), the score does not change significantly; in the second case (chr1- chr18), the score increases so that a translocation is now called; and in the third case (chr16 - chr19), the score decreases so that a mistaken call is avoided. Using the chromosomal pairs reported in the literature or validated by FISH experiments [4, 29] as true positives, we see that the adjusted matrix results in an increased prediction accuracy, as measured by the area under the curve (AUC) (Fig. 4b; see “Methods”). As visualized in Fig. 4c, the previously observed biases are effectively reduced by the normalization, allowing for better delineation of translocations (Additional file 1: Fig. S11, Additional file 1: Fig. S12A-D).
Although the rank product approach detects the majority of translocated chromosomal pairs, four validated translocations are not identified. To investigate this issue, we compare the Hi-C interaction matrices of the detected (Additional file 1: Fig. S13) and missed chromosomal pairs (Additional file 1: Fig. S14). Compared to the detected chromosomal pairs, no translocation signature can be visually detected from the interaction matrices for missed pairs. In addition, the sharp boundaries at translocation breakpoints on the 1D coverage profile can only be found in our predicted translocated chromosomal pairs. Thus, we believe that there are some translocated chromosomal pairs that are simply not reflected within Hi-C data, or the validation data may be incorrect, e.g., due to the variation among the K562 lines. We further examined four more cancer cell lines, including HelaS3 (cervical carcinoma), LNCaP, PANC-1, and T47D (breast cancer), for which FISH data were available for validation. We found that the rank product and the maximum interaction perform better than the Gini index in LNCaP, T47D, and PANC-1, whereas the rank product and Gini index are more predictive in HelaS3 (Additional file 1: Fig. S12E).
HiNT detects translocation breakpoints at single base-pair resolution using Hi-C chimeric reads
Once a chromosomal pair containing a translocation is identified based on the rank product, HiNT searches for the translocation breakpoint. For a translocation, the 1D row/column-sum profile should change abruptly at the breakpoint (Additional file 1: Fig. S13, and Additional file 1: Fig. S15A). To identify this point, we use a change point detection method called breakpoints from the R package strucchange [31], which adopts a linear model to detect one or several change points in multivariate time series. However, the majority of the change points detected by breakpoints are the result of lower mappability and unremoved compartment effects and thus should not be identified as the translocation breakpoints (Additional file 1: Fig. S15A). To remove these false positives, we impose a filtering step in which only those with one quadrant (unbalanced translocation) or two diagonally opposite quadrants (balanced translocation) around the candidate breakpoint have very high interactions (Additional file 1: Fig. S15, “Methods”). Here, we define a high interaction frequency as being greater than the 99th percentile of all the interactions between the two chromosomes.
Next, we determine the precise coordinates of the breakpoints by using ambiguous chimeric reads [16] (Fig. 1a). These reads have their primary alignment near a breakpoint in one chromosome (e.g., chrA) and their clipped part align near a breakpoint in another chromosome (e.g., chrB). HiNT provides the intervals in which the breakpoints occur (100 kb resolution) and, as long as the breakpoint does not occur in an unmappable region, the exact location of the breakpoint (1 bp resolution).
Hi-C can supplement WGS by locating translocation breakpoints in repetitive regions
To assess its performance, we compare the translocation breakpoints determined from Hi-C using HiNT with those detected from WGS using Delly [32] and Meerkat [33]. In K562, 89 and 135 interchromosomal translocations are detected by Meerkat and Delly (see “Methods”), respectively, with only 20 translocations detected by both (Fig. 5a, Additional file 3: Table S2). This level of discrepancy is not unexpected [34] and is indicative of the difficulty of detecting SVs in general. When we intersect these 20 consensus WGS-based translocations with those detected by HiNT, we find that 5 are in common (Fig. 5a). Two additional ones were found by HiNT and either Meerkat or Delly but not both. In these 7 cases, the breakpoints were exactly the same at the nucleotide level, confirming the accuracy of the calls (Additional file 4: Table S3). An example is a translocation between chromosome 9 and 22 shown in Fig. 5b, with more than 100 supporting clipped reads in Hi-C data and many discordant reads in WGS data (Fig. 5c).
Thirty-three translocations are detected only from Hi-C data (Fig. 5a; listed in Additional file 5: Table S4). For example, a significant rank product score is found between chr3 and chr18 in the Hi-C interaction matrix (Fig. 5d), and three breakpoint regions are detected by HiNT including one validated by FISH [29] (Additional file 6: Table S5). However, few discordant reads are identified from WGS. A major reason for this difference is the low mappability around those breakpoints. As illustrated in Supplementary Figure 1, the long physical distance between Hi-C read pairs allow identification of translocations whose breakpoints occur in a repetitive region—the paired reads can “jump over” the repeat region and map to surrounding mappable regions, even though the breakpoint itself cannot be mapped. Indeed, we find that large repeat (> 1 kb) regions (as found in repBase [35]) make up a disproportionately large fraction of regions containing Hi-C-only breakpoints compared to WGS consensus breakpoints (Fig. 5e). We note that repetitive regions with high sequence divergence are mappable, but we used the term “repetitive region” for conceptual clarity.
For the translocations detected only in WGS, 6 out of 15 are missed in Hi-C simply because of the lower spatial resolution in Hi-C. Due to the nature of the assay, the coverage in the intergenic regions is especially sparse, regardless of the sequencing depth. As illustrated by two examples in Additional file 1: Fig. S16, when there is a translocation that turns out to be an insertion of a small segment from another chromosome, the Hi-C map does not show clear evidence (position indicated by a red dotted cross in the lower-left panel). When one zooms into that area, some interaction indicative of a translocated boundary is present; however, the interaction is too weak to be detected unless one lowers the detection criteria as to incur too many false positives. Two of the other nine cases appear to be complex SVs. In the two examples we show in Additional file 1: Fig. S17, discordant reads around the breakpoint are from two different chromosomes (indicated by different colors). Regardless of the exact details of the SV, it is clear that the Hi-C map (lower-left panel) does not capture the interactions; thus, HiNT cannot detect them. For the remaining seven cases, we believe these are false positive calls in WGS, often occurring in repetitive regions. We find that the discordant reads from WGS for these cases contain a large fraction of single nucleotide variants or have low mapping qualities, indicating issues in read alignment (Additional file 1: Fig. S18). Consistent with those being false positive WGS calls, no translocation-associated features are found in the Hi-C interaction maps. These analyses suggest that Hi-C is a powerful tool to detect translocations and can complement WGS, especially for detecting those with breakpoints in repetitive regions.
HiNT outperforms existing tools on detecting translocations
Others have attempted to identify structural variants from Hi-C data. One approach is simply to visually inspect the interaction heatmaps—a low-resolution detection of breakpoints with poor scalability and reproducibility [28]. Better approaches search for regions that contain abnormal interaction frequencies based on normalized Hi-C interaction maps [6, 36]. However, such methods utilizing only contact frequencies cannot easily distinguish translocations from chromatin interactions, thus giving a high false discovery rate (FDR). A recent algorithm HiCtrans [23] identifies translocation breakpoints based on change point statistics obtained by scanning the interchromosomal contact maps of each chromosomal pair. However, searching the breakpoints across all interchromosomal contact maps leads to a high computational cost. For a comprehensive set of inter- and intrachromosomal translocations, one could integrate WGS, Hi-C, and optical mapping data [29]. However, in most cases, it is impractical to generate all these data types for a given sample. The method they used for Hi-C data [29] is hic_breakfinder, an iterative approach to locate local clusters that deviate from the expected interaction frequencies in a Hi-C contact matrix.
To compare the performance of these algorithms, we first apply HiCtrans [23] and HiNT to simulation data. Hic_breakfinder [29] is not used here because it requires the aligned reads in BAM format, but our simulation is matrix-based. Of the 21 simulated interchromosomal translocations (mix of balanced/unbalanced and heterozygous/homozygous translocations), HiNT identified 20 correctly while calling an additional 5 breakpoints (Additional file 1: Fig. S19A). The one missing translocation was located at the centromere of chr21 (Additional file 1: Fig. S19B). In contrast, HiCtrans called 531 translocations (distributed across 100 different chromosomal pairs), but none were bona fide translocations (Additional file 1: Fig. S19C).
We also compared HiNT, HiCtrans [23], and hic_breakfinder [29] on the K562, LNCaP, PANC-1, and T47D data. As shown in Additional file 1: Fig. S19D-E, HiNT has the highest AUC measure in most cell lines (0.85 vs 0.78 and 0.77 in K562, 0.98 vs 0.96 and 0.93 in LNCaP, 0.84 vs 0.87 and 0.7 in PANC-1 and 0.97 vs 0.84 and 0.93 in T47D, see “Methods”) as well as the best precision-recall curve. Additionally, in K562, we found that while HiCtrans identified 132 translocated chromosomal pairs, which is more than half of the number of all chromosomal pairs, only 10 of them contain known translocations. Among all 931 breakpoints (~ 1 Mb resolution) identified by HiCtrans, only 2 of them cover what are detected from WGS by both Meerkat and Delly (Additional file 1: Fig. S19F). On the other hand, hic_breakfinder identified 77 breakpoints (~ 100 kb resolution). Among these breakpoints, 4 are identified by both Meerkat and Delly (Additional file 1: Fig. S19F). This suggests a higher false discovery rate of HiCtrans and hic_breakfinder than HiNT. Furthermore, we found that 60% (24/40) of HiNT-identified breakpoints can also be identified by other methods. In contrast, this value is only 35% (26/77) and 3.0% (27/931) for breakpoints output from hic_breakfinder and hictrans, respectively (Additional file 1: Fig. S19F). Collectively, HiNT-TL outperforms HiCtrans and hic_breakfinder in both specificity and accuracy.