Bacon: a comprehensive computational benchmarking framework for evaluating targeted chromatin conformation capture-specific methodologies

Chromatin conformation capture (3C)-based technologies have enabled the accurate detection of topological genomic interactions, and the adoption of ChIP techniques to 3C-based protocols makes it possible to identify long-range interactions. To analyze these large and complex datasets, computational methods are undergoing rapid and expansive evolution. Thus, a thorough evaluation of these analytical pipelines is necessary to identify which commonly used algorithms and processing pipelines need to be improved. Here we present a comprehensive benchmark framework, Bacon, to evaluate the performance of several computational methods. Finally, we provide practical recommendations for users working with HiChIP and/or ChIA-PET analyses. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02597-4.

generation sequencing, allowing the identification of long-range contacts bound by a transcription factor or chromatin mark of interest [17]. The recently developed longreads ChIA-PET [18] protocol provides better resolution and requires lower chromatin input than traditional ChIA-PET. HiChIP [19] and PLAC-seq [20] were both developed to improve the efficiency and sensitivity of ChIA-PET through the implementation of transposase-mediated library construction. However, it is a challenge to interpret targeted conformation capture data quantitatively, owing to the high prevalence of sequenced ligation junctions formed from uninformative close-range contacts. Moreover, the data analysis is complicated by the relative inefficiency of chromatin immunoprecipitation-based methods, which often lead to low library complexity. Therefore, robust and efficient computational methods are required to remove the biases associated with these molecular protocols and more accurately quantify chromatin contacts.
The ChIA-PET Tool [21] first proposed a general pipeline for processing of ChIA-PET data, including linker trimming, read alignment, loop detection, and significance estimation. The ChIA-PET Tool uses a hypergeometric (HG) distribution to count loops, and the HG model assumes that the random pairing chance of two anchor regions increases as the sequencing depth of the two anchor regions increases. Several subsequent methods were mostly based on this underlying procedure with only slight modifications. ChiaSig [22] improved the model by employing a non-central HG distribution, and the model added an additional factor, the distance between two anchor regions. The Model-based Interaction Calling from ChIA-PET (MICC) uses a Bayesian mixture model to systematically remove random ligation and random collision noise [23]. Another popular computational pipeline, mango [24], uses a binomial model to detect statistically significant interactions for ChIA-PET data, and also corrects for major sources of ChIA-PET data bias, including differential peak enrichment and genomic proximity. ChIA-PET2 [25] took a Bayesian mixture model to provide a flexible pipeline for analyzing different types of ChIA-PET data, and it also supports allelespecific analyses. ChIA-PET Tool V3 [26] is an updated version of the ChIA-PET Tool, which processes short-read and long-read ChIA-PET data with multithreads. A recently developed method ChIAPoP [27] uses positive Poisson to distinguish the significant interactions from noisy ChIA-PET data. Hichipper [28] employs a background model to identify loops, which incorporates the effect of restriction enzyme site bias. MAPS [29] adopted a zero-truncated Poisson regression framework to explicitly remove the biases of HiChIP/PLAC-seq data, and then identifies the chromatin interactions by the normalized contact frequencies. FitHiChIP [30] leverages the non-uniform coverage and genomic distance scaling of contact counts to compute the significance of estimated loops. Also, HiCCUPS is a loop caller developed for Hi-C data, which also can be used to call HiChIP loops [31].
All the computational methods mentioned above are peak-based and tend to integrate the popular peak calling algorithm MACS2 [32] or similar pipelines to facilitate the positioning of loop anchors. However, given the protocol differences, many of the peak-based computational methods cannot be applied to ChIA-PET and HiChIP data simultaneously. Accordingly, cluster-based methods were developed to fit both types of datasets. cLoops [33] was based on the clustering algorithm cDBSCAN, which takes Paired-End Tags (PETs) and analyzes them directly by a permuted local background to estimate significance. A similar computational pipeline, CID [34], discovers chromatin interactions with an unbiased clustering approach that identifies loop anchors by splitting the PET groups iteratively. Recently, several benchmarking studies on Hi-C methods have been published [35][36][37]; however, computational benchmarking for targeted chromatin conformation capture-specific methodologies are lacking.
With this in mind, we present a comprehensive benchmark framework, Bacon, to evaluate the performance of targeted chromatin conformation capture-specific methodologies. Due to the intrinsic biases that exist for targeted conformation data, we systematically characterized the differences between two closely related technologies (ChIA-PET and HiChIP) and built the Bacon framework based on the established distinctions. In this study, we benchmarked 12 computational pipelines using 22 experimental datasets and 6 simulations. Finally, Bacon provides practical guidance for users and aids in the rational development of improved pipelines for developers.

Data characteristics of ChIA-PET and HiChIP
HiChIP and ChIA-PET both produce similar information about protein-specific topological interactions and are interpreted in much the same way. However, the two protocols differ greatly in how they capture this information. One important distinction between the two is that HiChIP incorporates restriction endonucleases to fragment the genome, while ChIA-PET traditionally relies upon sonication. And while this difference has been taken into account by some computational pipelines [28], a full characterization of the differences and similarities that exist between HiChIP and ChIA-PET is lacking. To characterize the read properties of HiChIP and ChIA-PET libraries, we compared their alignments with publicly available ChIP-seq data (for the processing of ChIP-seq data, see "Methods"). Principal component analysis (PCA) showed that the ChIP-seq, ChIA-PET, and HiChIP replicates clustered together into their respective experimental groups (Fig. 1A, B) (for PCA analysis, see "Methods"). Notably, mESC-Smc1 HiChIP data was heavily impacted by restriction enzyme treatment, as the read distribution of mESC-Smc1 HiChIP presented restriction enzyme cut site bias. Correspondingly, mESC-Smc1 HiChIP reads were sparse in positions without Mbol restriction sites (Additional file 1: Fig. S1A). To compare the bias of these two methodologies in a quantitative manner, we called peaks for 20 ChIP-seq, 10 ChIA-PET, and 12 HiChIP datasets separately, and then calculated the distribution of Mbol restriction enzyme motifs near peaks. We found the count distribution of restriction enzyme sites also showed more enzyme sites located within HiChIP peaks (Fig. 1C). Overall, 58.9% of HiChIP peaks overlapped with Mbol restriction enzyme sites ( Fig.  1D-F), which was much greater than the 22.9% observed for ChIA-PET.
An important control for evaluating HiChIP and ChIA-PET experiments is a favorable overlap with ChIP-seq data. We next performed differential analysis between 10 ChIA-PET, 12 HiChIP, and 20 ChIP-seq datasets (See "Methods"). A total of 5046 peaks that differed significantly between ChIA-PET and ChIP-seq were identified, and 3549 of these were ChIA-PET enriched (Fig. 1G). Conversely, 29,136 peaks differed between ChIP-seq and HiChIP datasets, among these, 26,222 were HiChIP-specific (Fig. 1H). We then looked at the peaks in each dataset had in common with ChIP-seq and found that 41.6% of ChIA-PET peaks overlapped with ChIP-seq peaks compared to 10% overlap of ChIP-seq with HiChIP peaks (Fig. 1I). Hence, HiChIP data generates more loops with elevated sensitivity compared to ChIA-PET; however, HiChIP produces data with a strong restriction enzyme site bias and a lower overall agreement with ChIP-seq data than ChIA-PET.
A benchmark framework for targeted chromatin conformation capture-specific methods In the current work, we developed Bacon, a computational benchmark framework that enables the characterization of the analysis steps for targeted chromatin conformation capture data, and the evaluation of the performance of different computational methods. Bacon addresses three fundamental processing steps for ChIA-PET and HiChIP datasets, including pre-processing, loop calling, and the detection of significant interactions ( Fig. 2A and Additional file 1: Fig. S2). Our framework also provides an evaluation of 12 different computational methods (Additional file 1: Table S1). Further, Bacon integrates 28 ChIA-PET and HiChIP datasets for testing (22 experimental, and 6 simulated datasets) ( Fig. 2B and Additional file 2: Table S6) and gathers gold standard interactions from the GEUVADIS Project [38], GTEx Project [39], CRISPRi perturbation screening [40], and ENCODE [41] for evaluating accuracy (Fig. 2C).
Bacon uses the Uniquely Valid Rate (UV Rate) (for the calculation of UV Rate, see "Methods") to evaluate the quality of wet-lab experiments, and the pre-processing effectiveness of each computational method. For loop calling, Bacon evaluates the reliability of anchors, as well as the accuracy of loops. The two state-of-the-art strategies to identify HiChIP and ChIA-PET loops are peak-based and cluster-based methods. In general, the peak-based methods start with peak calling by implementing MACS2 or other peak calling algorithms. Bacon utilizes peak co-occupancy (PC) (for the calculation of PC, see "Methods") to evaluate the reliability of anchors identified by the peakbased methods. Cluster-based methods typically use the read density or the distance within two paired-end tags (PETs) to identify loops. Bacon evaluates the enrichment levels of cluster-based anchors using an enrichment score (ES) (for the calculation of ES, see "Methods"). The accuracy of loops (ACC) is evaluated through the comparison of each output to a gold standard loop set (for the collection of gold standard loops and calculation of ACC, see "Methods").
To detect statistically significant genomic interactions, different strategies are applied by the current computational methods. Bacon validates the functionality of significant loops by calculating the activation rate (AR) (see "Methods"), which estimates the epigenetic functionality of loops through the incorporation of several active histone markers, such as H3K27ac, H3K4me1, and H3K4me3. Fig. 2 The schema of Bacon. A Overview of approach. The processing steps are connected by arrows, blue squares indicate the categories of low-quality PETs to be filtered, and UV Rate, PC/ES, ACC, and AR are the evaluation metrics employed by Bacon to estimate the performance of different methods. B The testing datasets integrated by Bacon. C Schematic displaying how gold standard loop sets were gathered to evaluate the accuracy of different methodologies

Evaluating the reliability of loop anchors
To evaluate the pre-processing steps required for any given method, Bacon calculates the UV Rate to estimate the percentage of valid PETs which uniquely map to the reference genome. Bacon considered 0 mismatched and 1 mismatched base pair during linker trimming for ChIA-PET2(CPT2) and ChIAPoP, mango incorporates a fixed setting, and for ChIA-PET Tool V3 (CPTv.3) Bacon used the default linker alignment score (Additional file 1: Table S2 and Fig. S3). The alignment score (MAPQ) was set to 30 to filter out the high-quality PETs. Among all the ChIA-PET datasets we evaluated, CPTv.3 generated the highest UV Rate, followed by mango (Fig. 3A). For HiChIP analysis, only HiC-Pro and MAPS include data pre-processing, and MAPS achieved a higher UV Rate than HiC-Pro (Fig. 3B). HiC-Pro was developed to analyze Hi-C data and as part of its' two-step alignment process removes dumped pairs, dangling ends, and self-circles. This effectively decreases the total number of unique valid PET output from HiC-Pro, at least partially explaining its reduced UV Rate when compared to MAPS which incorporates these categories of pairs. Moreover, the UV Rate of HiChIP data ranged from 43.28 to 59.18%, which was greater than the UV Rate output from ChIA-PET data (3.23-22.95%) (Fig. 3C). These results suggest that the sensitivity of HiChIP is greater than ChIA-PET, which is consistent with previous studies [19,42].
Within the framework of Bacon, loops are called from different peak-based and cluster-based methods separately. In accordance with the results of Fig. 3C, the number of loops detected from HiChIP data was greater than for ChIA-PET (Fig. 3D). Among the peak-based methods we compared, MAPS supported the input of reference peaks (high-quality ChIP-seq peaks) to facilitate loop detection and remove noise. Utilizing a conserved strategy, MAPS retains not only the pairs with both ends overlapping a reference peak, but also the pairs with only one end overlapping. Hence, the number of loops called by MAPS ranked first among all the pipelines we evaluated. A HiChIPspecific software package known as Hichipper supports three types of inputs: reference peaks, only self-and dangling peaks, and all HiChIP peaks (Additional file 1: Fig. S4). We compared all the various settings for Hichipper and found that HiChIP (All peaks) and the combination of all replicates as input produced the greatest number of significant loops with this pipeline (Fig. 3D). Currently, there are very few cluster-based methods available. Only CID [34] and cLoops packages [33] are applicable for performing both HiChIP and ChIA-PET analysis, so we chose these to evaluate with Bacon. The results showed that both algorithms generated more loops when applied to HiChIP data compared to ChIA-PET datasets. The analysis of cluster-based results also indicated that CID produced more loops than cLoops (Additional file 1: Fig. S5).
To evaluate the reliability of peak-based loop anchors, Bacon calculates peak cooccupancy (PC) to estimate the overlapping percentage between loop anchors and ChIP-seq peaks. The results of PC analysis showed that ChIAPoP achieved the highest occupancy with published ChIP-seq peaks across all of the ChIA-PET analytical methods, and Hichipper(+chip) achieved the best among all of the HiChIP analytical methods followed by MAPS and FitHiChIP (Fig. 3E), since these three methods took ChIP-seq peaks as input to detect loops. In addition, peak intensity analysis of loop anchors showed that ChIAPoP and Hichipper(+chip) achieved higher peak intensity values than all the other methods (Additional file 1: Fig. S8). To investigate whether PC was impacted by the length of the anchor or the size of peak, we performed Pearson's correlation analysis between PC and the length of anchor/the size of peak; however, the results did not show significant correlations (Fig. 3F,G).
To estimate the reliability of cluster-based loop anchors, we developed Bacon to calculate the enrichment score (ES) for the loop anchors generated by CID and cLoops. ES was calculated via a site-by-site evaluation, which indicates whether the observed Bar graph displays the distribution of PET counts. The red line represents the average PET counts for the corresponding loop sets. E The peak co-occupancy (PC) of peak-based loop anchors. F Pearson's correlation between PC and anchor length. G Pearson's correlation between PC and peak size. R, correlation coefficient. p value calculated by t-test. H Global ES of two cluster-based methods in all datasets. *** p value < 1e−3, p value was calculated by t-test. I The overlap between mESC-Smc1 ChIP-seq peaks, peak-based anchors, and cluster-based anchors. J The genomic annotation of mESC-Smc1 ChIP-seq peaks, peak-based anchors, overlapping anchors, and cluster-specific anchors. All the error bars represent the calculated mean and standard error enriched site is significantly enriched, and the global ES reflects the average accuracy of the loops from the whole genome. Overall, we found that cLoops achieved a higher global ES than CID for the datasets we evaluated, indicating cLoops can detect more highly enriched (strong) loops; however, there still exists the possibility that CID is more sensitive in identifying true-weak loops ( Fig. 3H and Additional file 1: Fig. S6-S7).
To investigate the differences between peak-based anchors and cluster-based anchors, we combined the loop anchors detected by all the peak-based methods of mESC-Smc1 ChIA-PET/HiChIP datasets, and also combined the loop anchors detected by the cluster-based methods. The adjacent loop anchors within each set were merged and duplicates were removed. The overlapping results showed that 30.7% of peak-based anchors overlapped with ChIP-seq peaks, while only 7.6% of cluster-based anchors overlapped with ChIP-seq peaks, and 87.6% of the cluster-based anchors were specific (Fig. 3I). We next annotated the ChIP-seq peaks, the overlapping anchors derived from each of the three specific loop sets, and cluster-specific anchors. The annotations indicated that cluster-specific anchors were less enriched at promoters (Fig. 3J), which suggests that cluster-based methods detect more loops within non-coding regions in the datasets we analyzed.
To further investigate the properties of the different loops called by these methods, we chose ChIAPoP as the representative ChIA-PET peak-based method and compared the ChIAPoP-specific loops, cLoops-specific loops, and CID-specific loops with active and inactive histone mark ChIP-seq data. The results showed that ChIAPoP produces more active chromatin-enriched peaks with higher H3K27ac signal than cLoops and CID, while cLoops and CID output loops with more inactive H3K27me3-enriched profiles (Additional file 1: Fig. S9A) in the dataset we analyzed. For the analysis of HiChIP, we chose Hichipper(+chip) as representative, and the comparisons showed similar results to what was observed for ChIA-PET (Additional file 1: Fig. S9B).

Evaluating the accuracy of loops
To evaluate the loops generated by different methods in a quantitative manner, we gathered cell type-specific long-range contacts from our gold standard loop sets. To ensure the fairness of comparison, we generated three gold standard loop sets for each testing dataset (for the gathering of gold standard loop set, see "Methods"). Accuracy (ACC) was calculated for True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) metrics (for the calculation of ACC, see "Methods"), for better comparison and visualization, we re-scaled ACC from 0 to 1.
For the ACC evaluation results of ChIA-PET datasets, ChIAPoP outperformed the other ChIA-PET analytical methods with high scaled-ACC (> 0.95) in all datasets. For HiChIP methods, FitHiChIP and Hichipper(+chip) performed better than the others, which achieved the high scaled-ACC (> 0.95) in 8 and 7 datasets (Fig. 4) (for raw ACC, see Additional file 3: Table S7, and Additional file 4: Table S8).
Although there were more HiChIP loops than ChIA-PET loops (Fig. 3D), the ACC of loops was independent of the number of loops (Additional file 1: Fig. S10A). The ACC of ChIA-PET loops was higher than that of HiChIP loops across all the testing datasets (Additional file 1: Fig. S10B). To investigate what impacted the results of ACC, we calculated Pearson's correlation coefficient for ACC and three other evaluation metrics.
The results suggested that UV Rate barely correlated with ACC (Additional file 1: Fig.  S10C), while PC and ES were positively correlated with ACC (Additional file 1: Fig.  S10D and S10E).

Functionality of statistically significant loops
To remove noise and improve the accuracy of detected loops, we next wanted to apply a variety of statistical methods to the final loop outputs produced by each analytical method. Current analysis packages employ different strategies to identify the significance of loops, for example, ChiaSig facilitated non-central hypergeometric (NCHG) distribution [22], and mango employs corrected p values to account for multiple hypothesis testing [24]. Since CID can only call loops, as suggested by Guo et al. [34], we utilized the MICC tool [23] to identify the significance of CID loops. FitHiChIP provides two types of background model (loose L or stringent S) to correct biases, here represented by FitHiChIP-L and FitHiChIP-S.
To compare the significant loops fairly, we firstly counted the number of significant loops with at least 3 PETs from ChIA-PET, at least 8 PETs with HiChIP data, and then set a p value threshold of 0.05 (false discovery rate (FDR) of 0.05 if accessible for the Fig. 4 Evaluating the accuracy of loops. A, B Dot plot displaying the scaled accuracy of loops. A gray circle indicates that the corresponding method failed to detect loops. The dark squares indicate that the scaled accuracy is greater than 0.95 method). To determine the properties and to detect the functionality of these significant loops, we utilized candidate enhancer-like and promoter-like signatures from EN-CODE [41] to annotate loops. Next, H3K27ac, H3K4me1, H3K4me3, and H3K27me3 ChIP-seq datasets were used to calculate the activation rate (AR) of enhancer-mediated loops (for the calculation of AR, see "Methods"), for better comparison and visualization, we re-scaled AR from 0 to 1. We then determined the AR of each individual method and found that ChIAPoP obtained the best AR in 12 ChIA-PET datasets, and Hichipper(+chip) obtained the highest AR for 8 of the HiChIP datasets, FitHiChIP-S achieved the highest AR for 5 of the HiChIP datasets (Fig. 5A, B).
To more rigorously assess the functional significance of the output loops from each analytical pipeline, we overlapped the detected loops with large-scale CRISPRi perturbation screening data obtained from K562 cells [40]. We found that ChIAPoP achieved the highest overlapping percentage in ChIA-PET (Fig. 5C), and Hichipper(+chip) performed best in HiChIP datasets, followed by FitHiChIP-S (Fig. 5D). Further, we interrogated the well-studied MYC locus with K562-H3K27ac ChIA-PET data, and we found that most methods detected contacts that overlapped with a set of previously validated MYC enhancers [43]. Notably, ChIAPoP loops contacted more CRISPRi MYC enhancers than the other methods, such as MYC-enhancer5, MYC-enhancer6, and MYC-enhancer7 (Fig. 5E). Secondly, we evaluated K562-H3K27ac HiChIP data against another set of CRISPRi validated loops [40], and we found that there were four methods detecting loops near the SEMA7A locus; however, only Hichipper(+chip) identified loops between this CRIPSRi-validated distal enhancer and the SEMA7A promoter region (Fig. 5F).

Discussion
The emergence of 3C-based techniques has enabled the accurate detection of 3D genomic interactions. Importantly, 3D genomic contacts are highly dynamic given the variability of chromosome structure [44,45]. In addition, 3C protocols take an average view of the chromatin interactions from a population of cells, and the limitations of penetrance may lead to the low availability of appreciable contacts, which hinders the interpretation of biological function. The adoption of ChIP techniques makes it possible to identify rare interactions mediated by a protein of interest, which are often undetectable by other 3C-based methods. Currently, it is a challenge to interpret targeted conformation capture data quantitatively. Moreover, low levels of ChIP enrichment often reduce the complexity of HiChIP and ChIA-PET libraries. Given these challenges, we developed Bacon, a benchmark framework to facilitate the comparison of computational methods and provide practical guidance for users and suggestions for the rational design of new analytical pipelines.
To provide practical guidance for users, we considered different conditions, different analysis strategies, and the performance of each tool (Fig. 6). The mean UV Rate, PC/ ES, scaled-ACC, and scaled-AR in all datasets were calculated and then annotated based on how well they performed across multiple data sets. We also recorded the running time for each loop calling method across different datasets (Additional file 1: Table S4-S5). For ChIA-PET analysis, ChIAPoP outperformed the others in reliability, accuracy and detecting activation; however, ChIAPoP required more running time than the other peak-based methods, and cannot be applied to datasets generated by long-read ChIA-PET protocols. For the HiChIP analytical methods, FitHiChIP-S and Hichipper(+chip) outperformed the others in PC, ACC, AR, and running time. However, both FitHiChIP and Hichipper only accept the valid pairs output from HiC-Pro to call loops, so if users want to perform the analysis procedure without switching methods, then MAPS is the only choice. We noticed that although cluster-based methods (CID and cLoops) can be applied to both ChIA-PET and HiChIP datasets, the ACC and AR metrics did not stand out, and these two methods also took more running time than the peak-based methods.
Here we provided several suggestions for the future development analytical pipelines for both HiChIP and ChIA-PET analysis. Importantly, more flexible parameters should be considered, such that future chemistries, sequencing read lengths, and other experiment-specific factors can be appropriately accounted for. We found that the methods we analyzed integrated different alignment tools, and the key mapping parameters were fixed by most pipelines. However, for different lengths of raw sequencing data, the alignment settings should be adjustable to achieve optimal results. Additionally, more reasonable self-ligation cutoffs should be set, such that the reads being input for loop calling are completely valid. Selfligation PETs are filtered out prior to calling loops, and the cutoff between a read PET being designated as a self-ligation product versus inter-ligation product ranges from 5 to 12 kb [21,33]. While most methods simply set the cutoff as a fixed value or asked users to set it themselves, the cutoff should be defined in a more rational way, such as being based on the distribution model of PET lengths. What is more, the available replicates should be rationally used to correct background and enhance the results, such as Hichipper employed a combinatory strategy to merge all the replicates to detect loops, which achieved the highest reproducibility in all HiChIP datasets (Additional file 1: Fig. S11). The recently developed long-read ChIA-PET [18] protocols provide better resolution and require lower chromatin input than ChIA-PET. The only publicly available longread ChIA-PET data is from Tang et al. [46] which does not match any available HiChIP conditions. Thus, we were only able to compare long-read ChIA-PET data with standard ChIA-PET data. Within all the ChIA-PET analytical methods, only CPT2 [25] and CPTv.3 [26] supported the raw data analysis of long-read ChIA-PET data; thus, we compared the performance of the latest approach ChIA-PIPE [47] with CPT2 and CPTv.3. The results showed that ChIA-PIPE performed slightly better than CPT2 and CPTv.3 in all the long-read ChIA-PET datasets (Additional file 1: Table S3). Overall, there still lacks more powerful tools suitable for the long-read ChIA-PET protocol.
Finally, we believe that the next generation of computational tools for analyzing protein-directed 3D chromatin topology should boost sensitivity to account for weak or dynamic interactions. Inherited from the processing pipelines of ChIP-based techniques, peak calling was the most frequently used strategy to detect the enrichment of PETs. Since peak calling results rely highly on ChIP enrichment, some weak or dynamic interactions are likely to go undetected [34]. Cluster-based methods rely on the relative read densities rather than the absolute values, making more locally enriched loops standout compared to peak-based methodologies. Thus, cluster-based methods offer a viable solution to detect more dynamic topological interactions [34]. However, this sensitivity comes at the expense accuracy. How to balance the detection of dynamic loops while maintaining the accurate detection of strong loops remains a challenge which needs to be addressed. A rational approach to the development of such an analytical pipeline would be a mixed model, where the accuracy of the top-performing peak-based approaches and the sensitivity of the cluster-based pipelines are combined. A number of considerations have to be made for such a mixed model, but ultimately it needs to limit noise, boost signal, and output an accurate representation of the factorenriched 3D genome. Firstly, noise has to be reduced. Both ChIAPoP and Hichipper(+ chip) reduced noise more effectively than other pipelines. ChIAPoP cleverly incorporates into their model the use of chimeric reads, in addition to sequencing depth and inter-anchor distance, for estimating noise. And the addition of ChIP-seq data to Hichipper also dramatically improves noise reduction. Both strategies incorporated together within a statistical framework could theoretically aid in the noise reduction necessary for the generation of a mixed peak-and cluster-based model.

Conclusion
Currently, many computational methods and packages are available to analyze HiChIP and ChIA-PET datasets. However, it is challenging to compare the performance of these different computational pipelines without the use of uniform gold standard datasets and evaluation metrics. With this in mind, we developed a comprehensive benchmark framework, Bacon, to evaluate the performance of different computational methods and provide practical recommendations to fit different analysis requirements. We investigated the diverse characteristics of ChIA-PET and HiChIP datasets, and deployed Bacon to benchmark 12 computational methods comprehensively. The evaluation results indicated that ChIAPoP outperformed the others in reliability and accuracy for ChIA-PET analysis, while FitHiChIP-S and Hichipper(+chip) outperformed the other available methods for HiChIP analysis. However, these methods still presented different deficiencies, and no single method stood out in every analytical aspect. Overall, the lessons learned from our evaluation of these analytical tools can be leveraged to improve the design of future computational pipelines.

Calculation of Uniquely mapped Valid Rate (UV Rate)
Different linker trimming strategies and alignment tools were used by different ChIA-PET and HiChIP analytical methods to pre-process the raw reads. In the preprocessing step of Bacon, we tried 0 and 1 mismatch for ChIAPoP and ChIA-PET2. And for ChIA-PETv.3, we used the default linker alignment score to trim linkers. For alignment, the minimum mapping rate (MAPQ) was set 30, and the duplicates were filtered by Picard. The Uniquely mapped Valid (UV) PETs were retained, and UV Rate was defined as the number of UV PETs divided by the number of total PETs. Although we set the same filtering threshold for different methods, the different fixed settings specific to each method impacted the UV Rate.
The GEUVADIS eQTL data was downloaded from ftp://ftp.ebi.ac.uk/pub/databases/ microarray/data/ex periment/GEUV/E-GEUV-1/analysis_results/. The GTEx eQTL data of transformed lymphocytes was downloaded from https://www.gtexportal.org/ home/datasets. For the variants and gene pairs of eQTLs data, we extracted the genomic loci of variants then extended to 5 kb length from both ends and extracted the genomic coordinates of genes from GENCODE v19 annotation. The verified Enhancergene pairs in K562 with CRISPR/Cas9 perturbations were downloaded from the study of Gasperini et al. [40].

Generation of Hi-C loops with strong signals
For the cell lines of K562, GM12878, mESC, MCF7, HeLa-S3, and AML12, the compressed binary (.hic) files were downloaded from https://bcm.app.box.com/v/aidenlab/ folder/11234760671, and GSE134621 [64]. Then HiCCUPS [31] was used to call loops for .hic files with the resolution of 1 Mb, and KR (Knight-Ruiz) balancing mode was used to normalize the data, the loops with strong signals were defined as FDR ≤ 0.01 and observed counts ≥ 5. For BMDMs, erythroblast, and PAEC, the Hi-C interactions were downloaded from GEO under accession number of GSE109965 [65], GSE168176 [66], and GSE152900, then we filtered the interactions with a threshold of 5.

Calculation of accuracy (ACC)
To evaluate the accuracy of different computational methods, we regarded gold standard loops as "true" loops, the "false" loops were constructed with the same number as "true" loops. We firstly constructed the potential false loops set by selecting the genomic location avoiding TSS regions of genes and enhancer-like regions from ENCODE, which excluded all potential enhancer-promoter loops. Then we excluded the eQTL containing loops, CRISPR/Cas9 validated loops, and strong Hi-C loop signals from the potential false loops set. Since the number of potential false loops was much greater than the "true" loops, we then randomly selected the same number of false loops from the potential loop set. To eliminate the effect of randomness, we repeated the selection three times, thus we had 3 "false loops" files for each true loops file. Then we defined true positive (TP) as the number of detected loops which intersected with gold standard loops, false positive (FP) as the number of detected loops intersected with "false" loops, true negative (TN) as the number of "false" loops NOT intersected with detected loops, and false negative (FN) as the number of gold standard loops NOT intersected with detected loops. Since there were three "false" loop files for each true loop file, we calculated the mean of three FPs as the final FP value to calculate accuracy. The accuracy (ACC) was calculated as follows:

Simulation of ChIA-PET and HiChIP data
To test the performance of computational methods in the simulation datasets with different ChIP antibodies, we generated anchor pairs for ChIA-PET and HiChIP, according to the different data characteristics between these two types of data. For ChIA-PET anchor pairs, a Poisson model was used as described previously [22]. Firstly, the chromosomes were segmented into bins of 5000 bp, the loci of segmented points were recorded and used to construct a potential contact pairs matrix, then we simulated the interaction between genomic loci i and j as n ij~P ois(λ ij ), the expectation λ ij ¼ a 1þδ , with δ = | i − j|, a represented the number of anchors used for simulation, and the interaction frequencies depend on the genomic distance strongly. The HiChIP anchor pairs were generated with Hi-C data as described by Bhattacharyya [30]. Firstly, we set the bin size (5000 bp) to extract Hi-C pairs as HiChIP anchor pairs. Secondly, the published ChIP-seq data publicly available for each antibody was used to simulate the ChIP enrichment, we then calculated ChIP-seq read coverage for the generated anchor pairs. Next, we set the coverage threshold as 50% to filter the final simulated valid pairs. Overall, we used three different ChIP antibodies to simulate the ChIA-PET and HiChIP data. Finally, we then tested the performances of different methods with these simulated datasets.

Calculation of peak co-occupancy (PC)
The uniquely mapped alignment files of ChIA-PET and HiChIP were obtained in the pre-processing step of Bacon, which were used to call loops by different computational methods. Bacon extracted the anchors from called loops, then the anchors with a minimum 90% of length overlapped with each other were merged. The public ChIP-seq peak was used as target set to detect overlaps with different anchor sets, all the anchors that had at least 1 bp overlap with the ChIP-seq peaks were gathered into a candidate set A, the total number of anchors in loop set was represented by N, and the number of anchors in A was represented by N A . The length of anchor i was represented by L(a i ).The peak overlapped with anchor i was P i , and the length of this peak was L(p i ), overlapping length between anchor i and ChIP-seq peak was represented by L(o i ). If there were more than one peak overlapped with the anchor, the peak with longest overlapping length was selected. The PC was calculated as follows: PCA and differential analysis for ChIP-seq, ChIA-PET and HiChIP Given that ChIP-seq, HiChIP, and ChIA-PET all rely on chromatin immunoprecipitation, they should display similar binding profiles when comparisons are being made for a single protein of interest. We implemented DiffBind [69] to perform PCA analysis and to give a deeper insight into how these experimental groups were associated. The alignment files derived from ChIP-seq, ChIA-PET, and HiChIP replicates were filtered using Samtools [70] with minimum MAPQ 30. Picard was then used to remove duplicates, and the filtered bam files were prepared as the input to function "dba," then a binding matrix with scores was calculated based on read counts for every sample with function "dba.count." The data were normalized based on sequencing depth with default setting of function "dba.normalize," then the function "dba.plotPCA" was used to see how well the samples cluster with one another. Before running the differential analysis, we used "DownsampleSam" function of Picard with "P = 0.2" to downsample the alignment files. "dba.contrast" function with default mode to model the data, as well as specify the comparisons we are interested in like ChIP-seq vs ChIA-PET and ChIP-seq vs HiChIP. Then "dba.analyze" function was used to perform the differential analysis, and the p value < 0.05 and log2(fold change) > 1 was used as the threshold of significance to detect differential peaks for one of the antibody. We repeated the differential analysis for ChIP-seq vs ChIA-PET and ChIPseq vs HiChIP on every antibody, then all the differential peaks were aggregated.

Calculation of enrichment score (ES)
ES was used to evaluate the enrichment level of loops identified by the cluster-based methods. Bacon defined ES on the assumption that the true enriched loci should be surrounded by relative low enriched PETs, which means the enrichment of anchor location was the local maximum. The genomic coordinates of two anchors were represented by s1, e1, s2, and e2, and Bacon firstly calculated the average length of two anchors, l m ¼ l a þl b 2 , in which l a = e1 − s1, l b = e2 − s2. P x was defined as the number of PETs in genomic region x. The enriched number of PETs within the neighbor region of loop i was represented by P i n .
The PET count of loop i was defined as C i , and ES of loop i was calculated as follows: If the value of ES i less than 1, indicating the enrichment of neighbor region was higher than the enrichment of loop anchor, loop i was thought to be an invalid loop. Thus, the higher the ES i , the more reliable the loop.
To estimate the enrichment level of all the loops in a genome-wide fashion, Bacon calculated the global ES, in which, ∝ is a coefficient to adjust the enrichment of low PET coverage regions caused by uneven sequencing depth. The PET coverage of whole genome was calculated by C = LN/G, C is for coverage, G is the length of genome, L is the length of PET, N is the number of PETs, C j is the PET coverage of j chromosome, and ∝ is calculated by ∝ ¼ C C j .

Estimating resolution level of loops
We firstly calculated the distance between two anchor regions, and the loops were segregated into three types according to the range of distance d: d ≤ 10 kb, 10 kb < d ≤ 100 kb, and 100 kb < d ≤ 1 Mb. The number of loops in each type divided by the total number of loops was the resolution level in this range, and the resolution level was then plotted as heatmap.
Activation rate (AR) of significant loops The significant loops were firstly filtered with at least 3 PETs of ChIA-PET data, at least 8 PETs of HiChIP data, and the p value threshold was set as 0.05, false discovery rate (FDR) threshold was 0.05 if accessible for the method. The candidate enhancerlike and promoter-like signatures files were downloaded from ENCODE, and the anchors of filtered loops were extracted to overlay with the enhancer-like and promoter-like elements, then the percentages of E-E/E-P/P-P loops were counted. Then the ChIP-seq peaks of active histone markers (H3K27ac, H3K4me1, H3K4me3), ATAC-seq peaks, and repressive histone marker (H3K27me3) were collected to overlap with the E-E/E-P/P-P loops. The overlapping length between anchors and peaks were calculated by Bedtools [71]; if the overlapping length of activate peaks is larger than that of inactive peaks, the loop was thought to be active. Otherwise, the loop was thought to be inactive. If there is no active or repressive peak overlying, or the active overlapping length is equal to the inactive overlapping length, the loop was classified to other type. The percentage of active loops was defined as activation rate (AR). The annotation of loops and the calculation of AR were implemented by homemade scripts (see Bacon webpage).
Additional file 2: Table S6. Statistics of experimental datasets and simulation datasets.
Additional file 3: Table S7. Evaluation results of ACC in ChIA-PET datasets.
Additional file 4: Table S8. Evaluation results of ACC in HiChIP datasets.
Additional file 5. Review history.

Review history
The review history is available as Additional file 5.

Peer review information
Wenjing She was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.