Many genome-wide collections of candidate cis-regulatory elements (cCREs) have been defined using genomic and epigenomic data, but it remains a major challenge to connect these elements to their target genes.
To facilitate the development of computational methods for predicting target genes, we develop a Benchmark of candidate Enhancer-Gene Interactions (BENGI) by integrating the recently developed Registry of cCREs with experimentally derived genomic interactions. We use BENGI to test several published computational methods for linking enhancers with genes, including signal correlation and the TargetFinder and PEP supervised learning methods. We find that while TargetFinder is the best-performing method, it is only modestly better than a baseline distance method for most benchmark datasets when trained and tested with the same cell type and that TargetFinder often does not outperform the distance method when applied across cell types.
Our results suggest that current computational methods need to be improved and that BENGI presents a useful framework for method development and testing.
With the rapid increases in genomic and epigenomic data in recent years, our ability to annotate regulatory elements across the human genome and predict their activities in specific cell and tissue types has substantially improved. Widely used approaches integrate multiple epigenetic signals such as chromatin accessibility, histone marks, and transcribed RNAs [1,2,3,4,5,6,7] to define collections of regulatory elements that can be used to study regulatory programs in diverse cell types and dissect the genetic variations associated with human diseases [5, 8,9,10,11].
To maximize the utility of regulatory elements, one must know which genes they regulate. We recently developed the Registry of candidate cis-Regulatory elements (cCREs), a collection of candidate regulatory genomic regions in humans and mice, by integrating chromatin accessibility (DNase-seq) data and histone mark ChIP-seq data from hundreds of biosamples generated by the ENCODE Consortium (http://screen.encodeproject.org). Over 75% of these cCREs have enhancer-like signatures (high chromatin accessibility as measured by a high DNase-seq signal and a high level of the enhancer-specific histone mark H3K27ac) and are located distal (> 2 kb) to an annotated transcription start site (TSS). For cCREs proximal to a TSS, it may be safe to assume that the TSS corresponds to the target gene, but to annotate the biological function of the TSS-distal cCREs and interpret the genetic variants that they harbor, we need to determine which genes they regulate.
Assigning enhancers to target genes on a genome-wide scale remains a difficult task. While one could assign an enhancer to the closest gene using linear distance, there are many examples of enhancers skipping over nearby genes in favor of more distal targets . Experimental assays such as Hi-C and ChIA-PET survey physical interactions between genomic regions [13,14,15,16,17], and by overlapping the anchors of these interactions with annotated enhancers and promoters, we can infer regulatory connections. Approaches based on quantitative trait loci (QTL) associate genetic variants in intergenic regions with genes via the variation in their expression levels across multiple individuals in a human population [18, 19]. Recently, a single-cell perturbation approach extended this idea . However, these assays are expensive to perform and have only been conducted at a high resolution in a small number of cell types. Therefore, we need to rely on computational methods to broadly predict enhancer-gene interactions.
One popular computational method for identifying enhancer-gene interactions is to correlate genomic and epigenomic signals at enhancers and gene promoters across multiple biosamples. This method is based on the assumption that enhancers and genes tend to be active or inactive in the same cell types. The first study to utilize this method linked enhancers with genes by correlating active histone mark signals at enhancers with gene expression across nine cell types . Several groups subsequently used similar approaches to link enhancers and genes by correlating various combinations of DNase, histone mark, transcription factor, and gene expression data [8, 21,22,23]. While these methods successfully identified a subset of biologically relevant interactions, their performance has yet to be systematically evaluated.
Other groups have developed supervised machine-learning methods that train statistical models on sets of known enhancer-gene pairs. Most of these models use epigenomic signals (e.g., histone marks, TFs, DNase) at enhancers, promoters, or intervening windows as input features [24,25,26,27]. PEP-motif, on the other hand, uses sequence-based features . The performance of these methods has not been systematically evaluated for several reasons. First, different methods use different definitions for enhancers ranging from EP300 peaks  to chromatin segments . Second, these methods use different datasets to define their gold standards, such as ChIA-PET interactions [24, 26] or Hi-C loops [26, 27], along with different methods for generating negative pairs. Finally, many of these methods use a traditional randomized cross-validation scheme, which results in severe overfitting of some supervised models due to overlapping features [29, 30].
To facilitate the development of target gene prediction methods, we developed a collection of benchmark datasets by integrating the Registry of cCREs with experimentally derived genomic interactions. We then tested several published methods for linking enhancers with genes, including signal correlation and the supervised learning methods TargetFinder and PEP [27, 28]. Overall, we found that while TargetFinder was the best-performing method, it was only modestly better than a baseline distance method for most benchmark datasets when trained and tested on the same cell type, and Target Finder often did not outperform the distance method when applied across cell types. Our results suggest that current computational methods need to be improved and that our benchmark presents a useful framework for method development and testing.
A Benchmark of candidate Enhancer-Gene Interactions (BENGI)
To effectively evaluate target gene prediction methods, we curated a Benchmark of candidate Enhancer-Gene Interactions (BENGI) by integrating our predicted enhancers, cCREs with enhancer-like signatures (cCREs-ELS), with 3D chromatin interactions, genetic interactions, and CRISPR/dCAS9 perturbations in a total of 21 datasets across 13 biosamples (Fig. 1a, Additional file 1: Tables S1 and Additional file 2: Table S2a). For 3D chromatin interactions, which include ChIA-PET, Hi-C, and CHi-C interactions, we selected all links with one anchor overlapping a distal cCRE-ELS and the other anchor falling within 2 kb of a GENCODE-annotated TSS (Fig. 1b, see “Methods”). For approximately three quarters of the total interactions, the anchor of the 3D chromatin interaction overlaps the proximal region of more than one gene, making the assignment of the exact gene target ambiguous. To assess the impact of these potentially ambiguous assignments, we created two versions of each 3D interaction benchmark dataset. In the first, we retained all cCRE-gene links; in the second, we removed links with ends within 2 kb of the TSSs of multiple genes (i.e., ambiguous pairs). For genetic interactions (cis-eQTLs) and CRISPR/dCas9 perturbations (crisprQTLs), we paired a cCRE-ELS with a gene if the cCRE overlapped the reported SNP or targeted region (Fig. 1b). In total, we curated over 162,000 unique cCRE-gene pairs across the 13 biosamples. Because these experimental datasets capture different aspects of enhancer-gene interactions (see statistical analyses in the next section), we retained the cCRE-gene pairs as separate datasets in BENGI.
To complement the positive cCRE-gene pairs in each BENGI dataset, we generated negative pairs for each cCRE-ELS by selecting all unpaired genes whose TSS was located within (either upstream or downstream) the 95th percentile distance from all positive cCRE-gene pairs in the dataset (Additional file 2: Table S2a, see “Methods”). These distance cutoffs ranged from 120 kb (RNAPII ChIA-PET in HeLa) to 1.83 Mb (Hi-C in K562). The percentages of positive pairs also varied from 1.8% (Hi-C in K562) to 23.5% (CHi-C in GM12878), and datasets with greater class imbalance (i.e., a smaller percentage of positive pairs) are inherently more challenging for a computational algorithm. To enable the comparison of algorithm performance across datasets, we further created datasets with a fixed ratio of one positive to four negatives for each BENGI dataset by randomly discarding the excess negatives. This strategy, along with the previously mentioned removal of ambiguous 3D chromatin interaction pairs, resulted in four BENGI datasets per ChIA-PET, Hi-C, or CHi-C experiment and two BENGI datasets per eQTL or crisprQTL experiment (Fig. 1c, Additional file 2: Table S2a). All pairs with a natural positive-negative ratio were used in our analyses unless otherwise noted.
To facilitate the training and testing of supervised machine-learning algorithms, we then assigned both positive and negative pairs to 12 cross-validation (CV) groups by chromosome such that pairs within the same chromosome were always assigned to the same CV group, while similar sizes were maintained for different CV groups by pairing one large chromosome with one small chromosome (chromCV, see “Methods”, Fig. 1d). Because GM12878 and other lymphoblastoid cell lines (LCLs) had the most BENGI datasets and have been extensively surveyed by the ENCODE and 1000 Genomes Consortia, we will highlight our analyses on the BENGI datasets from LCLs.
Summary statistics of BENGI datasets
We asked whether the various types of chromatin, genetic, and CRISPR experiments might capture different types of enhancer-gene interactions. To answer this question, we carried out several statistical analyses across the BENGI datasets. First, we performed hierarchical clustering of the six BENGI datasets in GM12878/LCLs by the overlap coefficient—the number of positive cCRE-gene pairs shared between two datasets divided by the number of positives in the smaller dataset. We obtained two clusters: one comprising the two eQTL datasets and the other comprising the four chromatin interaction datasets (Fig. 2a). This overall grouping of the datasets was consistent with the characteristics of the experimental techniques (Table 1). Beyond the overall grouping, the two eQTL datasets exhibited higher overlap coefficients with the RNAPII ChIA-PET and CHi-C datasets (0.20–0.36) than with the Hi-C and CTCF ChIA-PET datasets (0.01–0.05). This reflects the promoter emphasis of the first four techniques, enriching for promoter-proximal interactions. In contrast, Hi-C identifies significantly more distant interactions than the other techniques (Fig. 2b, Additional file 3: Figure S1a, Wilcoxon rank-sum test p value = 1.1E−223). Additionally, we note that the eQTL and crisprQTL interactions all have maximum distances of 1 Mb (Additional file 3: Figure S1a) because the original studies only tested SNPs within 1 Mb of each gene.
We then compared the gene expression of the positive pairs among the six GM12878/LCL datasets (Fig. 2c). Overall, the genes in the GEUVADIS eQTL pairs exhibited the highest median expression (median = 10.9 transcripts per million sequenced reads, or TPM; Wilcoxon rank-sum test p = 1E−3), while the genes in the CHi-C pairs presented the lowest median expression levels (median = 0.24 TPM, p = 7E− 39). When we removed ambiguous pairs, gene expression increased significantly for all four chromatin interaction datasets (Additional file 3: Figure S1b), suggesting that some of the ambiguous pairs were false positives. We observed similar increases in gene expression upon the removal of ambiguous pairs in other cell types for which we had RNA-seq data (Additional file 3: Figure S1c-e). Without the ambiguous pairs, the RNAPII ChIA-PET pairs showed comparable expression to the GEUVADIS eQTL pairs. The enrichment for RNAPII in the ChIA-PET protocol may preferentially identify interactions that involve higher RNAPII activity and higher gene expression. The K562 crisprQTL pairs presented the highest overall median expression of 26.4 TPM. We expected to observe high expression for the eQTL and crisprQTL datasets because these interactions can only be detected for genes that are expressed in the respective biosamples.
We also observed significant differences in the CTCF ChIP-seq signals at cCREs-ELS between the BENGI datasets: cCREs-ELS in CTCF ChIA-PET pairs and Hi-C pairs showed significantly higher CTCF signals than cCREs-ELS in the other datasets (Wilcoxon rank-sum test p < 3.7E− 9, Fig. 2d, Additional file 2: Table S2b). Similarly, these pairs were enriched for components of the cohesin complex such as RAD21 and SMC3 (Additional file 2: Table S2b). This enrichment for CTCF was biologically consistent, as CTCF was the target in the ChIA-PET experiment, and Hi-C loops are enriched for convergent CTCF binding sites .
Finally, we tallied the number of linked genes for each cCRE-ELS. Across all BENGI datasets, the majority of cCREs-ELS were linked to just one target gene (Fig. 2e, Additional file 2: Table S2c). As expected, this trend was more pronounced for 3D chromatin datasets without ambiguous pairs (on average, 84% of cCREs-ELS were paired with only one gene, p < 3.3E−5). With or without ambiguous pairs, a lower percentage of cCREs-ELS in CHi-C pairs was paired with just one gene (19% of all pairs and 55% of unambiguous pairs) than in the other BENGI datasets (p < 3.1E− 75). This observation, along with the lower average expression of the linked genes (Fig. 2c), suggests that some of the CHi-C pairs were either false positives or captured interactions between cCREs-ELS and genes that are yet to be expressed.
These analyses suggested that the various experimental techniques whose results formed the basis of the BENGI datasets capture different classes of genomic interactions. Because we do not have a complete understanding of which experimental techniques are best able to capture bona fide enhancer-gene interactions, we propose that computational methods (Table 2) should be evaluated on the entire collection of these BENGI datasets to provide a comprehensive understanding of their performance.
A baseline method of target gene prediction using genomic distance
Using the BENGI datasets, we evaluated a simple closest gene method for target gene prediction: a cCRE-ELS was assigned to its closest gene in terms of linear distance, computed by subtracting the genomic coordinates of the cCRE and the nearest TSS. All BENGI datasets, despite interaction type, had highly similar ELS-gene distance distributions (Additional file 3: Figure S1f). We tested this method using two gene sets, consisting of all genes or all protein-coding genes annotated by GENCODE V19, by evaluating precision and recall on the basis of each BENGI dataset. The use of protein-coding genes invariably resulted in better performance than the use of all genes (50% better on average over all 21 datasets across cell types; Additional file 2: Table S2d); thus, we used protein-coding genes for all subsequent analyses with this method.
The closest gene method worked best for crisprQTL pairs (precision = 0.67 and recall = 0.60), followed by ChIA-PET RNAPII pairs (precision = 0.66 and recall = 0.31 averaged across cell lines). The method performed worst for Hi-C pairs, with an average precision of 0.19 and an average recall of 0.12. These results are consistent with our statistical analyses described above, which revealed that crisprQTL and RNAPII ChIA-PET pairs were enriched in gene-proximal interactions, while Hi-C pairs tended to identify more distal interactions.
For comparison with other enhancer-gene prediction methods, we adapted the closest gene method to a quantitative ranking scheme where we ordered cCRE-gene pairs by the distance between the cCRE-ELS and the gene’s closest TSS. For each BENGI dataset, we evaluated the overall performance of the resulting distance method by calculating the area under the precision-recall curve (AUPR). Accordingly, the distance method exhibited the highest AUPR (0.41) for RNAPII ChIA-PET pairs and the lowest AUPR (0.06) for Hi-C pairs (Fig. 3a,b, Additional file 3: Figure S2b, Additional file 4: Table S3). Since the distance method is cell-type independent and does not require any experimental data, we considered it as the baseline method for comparing all enhancer-gene prediction methods.
Correlation-based approaches perform worse than the distance method
We next evaluated the performance of two correlation-based methods with the BENGI datasets: a method based on correlating the DNase signals at predicted enhancers with the DNase signals at TSSs across a panel of biosamples  and a method based on correlating DNase signals with gene expression . Both the DNase-DNase and DNase-expression methods outperformed random predictions for all 21 BENGI datasets, with average AUPR values of 0.10 and 0.12 vs. 0.07, respectively, but the differences were modest (Additional file 3: Figure S2; Additional file 4: Table S3). As previously demonstrated , positive pairs presented significantly higher correlations under both methods than negative pairs in all datasets (Additional file 3: Figure S2); however, the relative rankings of these correlations were mixed and did not completely segregate positive from negative pairs. The DNase-expression method significantly outperformed the DNase-DNase method for all but two BENGI datasets (Wilcoxon signed-rank test p = 6.7E−5), with an average AUPR increase of 29% (Additional file 2: Table S2).
We then evaluated the performance of the GeneHancer prediction model, via an integration of four types of enhancer annotations, including an earlier version of our cCREs, to generate a collection of candidate enhancers . These candidate enhancers were then linked to genes by integrating co-expression correlations, eQTLs, CHi-C data, and genomic distance. Because the authors used eQTLs and CHi-C from the same data sources as those in BENGI to build the GeneHancer model, we only evaluated the performance of the model on the ChIA-PET, Hi-C, and crisprQTL pairs. While the GeneHancer predictions were better than random predictions, the differences were extremely modest (average improvement of 0.01 in AUPR). The GeneHancer predictions also had a much lower overall recall than the correlations methods (on average 8% compared to 100% and 76% for DNase-DNase and DNase-expression respectively). Even for these limited sets of predictions, GeneHancer never outperformed the DNase-expression model and only outperformed the DNase-DNase model for crisprQTLs (Additional file 3: Figure S3).
Ultimately, the distance method substantially outperformed the two correlation-based methods and the GeneHancer predictions: distance was better than DNase-DNase for all 21 datasets (average AUPR increase of 127%; p = 1.9E−6; Additional file 2: Table S2), better than DNase-expression for 17 datasets (average AUPR increase of 77%; p = 1.6E−4), and better than GeneHancer predictions for all datasets (average AUPR increase of 256%; p = 9.5E−7). The PR curves of the distance method and the two correlation-based methods for the RNAPII ChIA-PET pairs are shown in Fig. 3a. For the first 25 k predictions, the distance method presented a similar precision to the DNase-DNase method and lower precision than the DNase-expression method, but when more predictions were made, the distance method substantially outperformed both correlation-based methods and achieved a much higher AUPR (0.41 vs. 0.28 and 0.26). We observed this crossover of PR curves in other non-QTL datasets as well (Additional file 3: Figure S2); thus, we integrated the distance and DNase-expression methods by averaging their ranks for the same prediction. Notably, this average-rank method showed high precision for its top-ranked predictions (Fig. 3a) and achieved higher AUPRs than the other methods for all 13 datasets except for GTEx eQTL pairs, with an average AUPR increase of 17% over the distance method for these datasets (Fig. 3b, Additional file 2: Table S2). For the eight GTEx eQTL datasets, the distance method remained the best approach, showing an 18% higher AUPR on average than the second-best method, average rank (Additional file 2: Table S2).
We asked why correlation-based methods performed poorly for predicting enhancer-gene pairs. One particular example is highlighted in Fig. 3 c, d. cCRE-ELS EH37E0853090 was paired with the AKIRIN2 gene by RNAPII ChIA-PET, Hi-C, CHi-C, and a GEUVADIS eQTL (Fig. 3c). However, this pair was poorly ranked by both correlation-based methods (correlation coefficients: r = 0.03 and 0.16 for DNase-DNase and DNase-expression, respectively). AKIRIN2 was highly expressed in most surveyed cell types (median normalized expression of 8.5 vs. background of 4.7 RPKM, Additional file 3: Figure S4a), and its promoter exhibited a high DNase signal (signal ≥ 50) for each of the DNase-seq groups (Additional file 3: Figure S4b). However, EH37E0853090 only presented high DNase signals in four cell types, which were all lymphoblastoid cell lines, suggesting that this enhancer was primarily active in the B cell lineage. The ubiquitous expression of AKIRIN2 and the cell-type-specific activity of EH37E0853091 resulted in a low correlation (Fig. 3d, Additional file 3: Figure S4b). In general, TSS-overlapping cCREs (cCREs-TSS) are active in many more biosamples than distal cCREs-ELS (median of 92 vs. 46 biosamples, p = 3.6E− 264, Additional file 3: Figure S4c-d). In summary, because the epigenomic signals at cCREs-ELS are far more cell type specific than the epigenomic signals at TSSs and gene expression profiles, correlation across biosamples is a poor method for detecting enhancer-gene pairs.
Supervised methods outperform baseline methods upon cross-validation
We tested two supervised machine-learning methods that were reported to perform well in the original publications on the methods: TargetFinder, which uses epigenomic signals such as histone mark ChIP-seq, TF ChIP-seq, DNase-seq in the corresponding cell types as input features, and PEP-motif, which uses the occurrence of TF sequence motifs as features. Xi et al. subsequently revealed that the original implementation of cross-validation (CV) by TargetFinder and PEP-motif allowed the assignment of enhancer-gene pairs from the same genomic loci to different CV groups, which led to sharing of training and testing data, overfitting of their models, and inflated performance . Thus, we implemented the chromCV method to ensure that pairs from the same chromosome were always assigned to the same CV group (Fig. 1e; “Methods”).
We first tested these two supervised methods on the six BENGI datasets in GM12878 because there were a large number of epigenomic datasets for this cell type that could be used as features to train the methods. Although PEP-motif performed better than random, it underperformed the distance method for all GM12878 pairs and was far worse than the average-rank method pairs (Fig. 4a, b; Additional file 2: Table S2b). In contrast, TargetFinder outperformed the average-rank method for all six datasets, with an average AUPR improvement of 66% (Fig. 4a, b; Additional file 2: Table S2), but the AUPRs were still low, especially for the Hi-C (0.17) and eQTL datasets (0.19 and 0.26).
Because the results of TargetFinder and PEP-motif upon our chromCV implementation were worse than the original published results for these methods, we also implemented a randomized 12-fold CV method as described in the original publications to test whether we could reproduce their results. Indeed, we observed large performance decreases for the chromCV method with respect to the original CV method (Fig. 4c), suggesting that overfitting was a source of inflated performance. PEP-motif presented a more substantial decrease in performance (average AUPR decrease of 80%) than TargetFinder (average AUPR decrease of 51%), likely because PEP-motif added 4 kb of padding on both sides of each enhancer, increasing the chance of overlapping training and testing data. Although PEP-motif and TargetFinder used Hi-C loops as the gold standard in their original analyses, both methods showed the largest performance decreases for the BENGI GM12878 Hi-C pairs (AUPR decrease of 95% for PEP-motif and 80% for TargetFinder). This analysis further highlights the utility of a carefully designed benchmark to prevent overfitting of supervised models.
Our implementation of TargetFinder in GM12878 cells involved 101 epigenomic datasets, including ChIP-seq data for 88 TFs, resulting in a total of 303 input features (Fig. 4d). However, such extensive TF ChIP-seq data were not available for other biosamples; thus, we also trained TargetFinder models using only distance and four epigenomic features: DNase, H3K4me3, H3K27ac, and CTCF data, which we refer to as the core4 TargetFinder models. While the core4 models exhibited an average AUPR reduction of 23% compared with the respective full models across the 13 BENGI datasets (Fig. 4a, b; Additional file 4: Table S3), they still outperformed the distance and average-rank methods for all datasets. Of particular note were the IMR-90 Hi-C pairs, which presented the greatest decrease in performance between the full and core4 TargetFinder models, with an AUPR reduction of 0.29 (81%). We observed similar large decreases in performance across all four variations of the IMR-90 Hi-C pairs. We also trained core3 models for the biosamples without CTCF data, and they showed an average AUPR reduction of 34% compared with the respective full models across the 13 BENGI datasets. For the seven GTEx eQTL datasets from tissues, these core3 models did not outperform the distance or average-rank models.
Overall, TargetFinder’s performance on the RNAPII and CTCF ChIA-PET pairs was markedly higher than its performance on other BENGI datasets. These datasets were the only two benchmarks of 3D chromatin interactions mediated by specific TFs. When we analyzed the feature-importance scores (i.e., Gini importance) from TargetFinder’s GBM model, we found that RNAPII and CTCF ChIP-seq signals at promoters had the highest importance in the respective models. To further dissect the features contributed to TargetFinder’s performance, we ran the algorithm on a subset of positive and negative pairs (1:2 ratio of positives to negatives) and three selections of positive and negative pairs that were matched for (i) only promoter inclusion, (ii) only distance, and (iii) promoter inclusion and distance (for promoter distance, see “Methods”). For all four subsets, the full TargetFinder still outperformed all other methods (Additional file 5: Table S4e); however, compared to the 1:2 ratio set (average AUPR = 0.86), performance was lower for the distance-matched and promoter-matched sets (average AUPR = 0.74 and 0.69) and was the lowest for the promoter-distance-matched sets (average AUPR = 0.61). We observed similar patterns with the TargetFinder core4 and core3 though the relative drop in performances was much larger—average decreases in AUPR of 0.25 for full model, 0.28 for core4 model, and 0.32 for core-3 model. Particularly, for the core3 CTCF ChIA-PET promoter-distance model, which does not include CTCF as a feature, we observed an AUPR of 0.43, a 0.30 reduction in AUPR compared to the 1:2 ratio pairs, and only a 0.03 improvement in AUPR over the DNase-DNase correlation method. These results suggest that differences in RNAPII/CTCF ChIP-seq signal and distance between positive and negative pairs contribute to TargetFinder’s ability to successfully predict cCRE-ELS-gene pairs.
TargetFinder exhibits moderate performance across different cell types
The most desirable application of a supervised method is to train the model in a biosample with 3D chromatin or genetic interaction data and then use the model to make predictions in another biosample without such data. Thus, we tested the TargetFinder core4 and core3 models for such application to the ChIA-PET, Hi-C, CHi-C, and GTEx eQTL datasets, readjusting our chromCV method to prevent overfitting  (see “Methods”).
As expected, the cross-cell-type models performed worse than the same-cell-type models, but their performance varied compared with the unsupervised distance and average-rank methods. For the CHi-C and RNAPII ChIA-PET datasets, all tested cross-cell-type TargetFinder models outperformed the distance and average-rank methods for both tested cell types (GM12878 vs. HeLa and GM12878 vs. CD34+), with average AUPR increases of 32% and 12%, respectively (Fig. 5a,b, Additional file 6: Table S5). For CTCF ChIA-PET, the core3 model trained on HeLa cells did not outperform the unsupervised methods for predicting GM12878 pairs (AUPR = 0.15 vs 0.21), but the models trained on GM12878 and the core4 model trained on HeLa did slightly outperform the unsupervised methods for predicting HeLa pairs and GM12878 pairs respectively (average AUPR increase of 7% Fig. 5c, Additional file 6: Table S5). The results for the Hi-C datasets were mixed. Among the 60 cross-cell-type models tested, 12 outperformed the distance and average-rank methods. Specifically, the model trained on GM12878 only outperformed the distance and average-rank methods for predicting HeLa or NHEK pairs (Fig. 5d, Additional file 6: Table S5), with an average 50% increase in performance. The model trained on IMR-90 never outperformed the distance and average-rank methods, and for the prediction of HMEC, IMR-90, and K562 pairs, none of the cross-cell-type models outperformed the distance or average-rank methods (Additional file 6: Table S5). These results were consistent across the fixed ratio pairs as well. Finally, none of the cross-cell-type models outperformed the distance method for the GTEx datasets; the distance method was the highest-performing model for all GTEx datasets (Additional file 6: Table S5).
Here, we have presented BENGI, a benchmark comprising cCRE-ELS-gene pairs, curated through the integration of the Registry of cCREs and genomic interaction datasets. We used BENGI to evaluate four published computational methods for target gene prediction that represent most of the widely used approaches in the field while surveying orthogonal dimensions: correlation methods survey across the biosample dimension, while supervised machine-learning methods such as TargetFinder survey across the assay dimension. We found that the two correlation-based, unsupervised methods significantly underperformed the baseline distance method, while one of the two supervised methods examined, TargetFinder, significantly outperformed the distance method when trained and tested within the same cell type by cross-validation. Although TargetFinder outperformed the distance method for all BENGI datasets, the AUPRs of the TargetFinder models were generally still low (0.07–0.72). In particular, TargetFinder performed best on ChIA-PET pairs; however, the performance substantially decreased when the positive and negative pairs were matched for their distributions of RNAPII/CTCF ChIP-seq signals at promoters and cCRE-ELS-gene distances. Thus, these features are the main contributors to TargetFinder’s higher performance on ChIA-PET datasets than other BENGI datasets. The other supervised method, PEP-motif, significantly underperformed the distance method, suggesting that the frequencies of TF motifs at enhancers and promoters are not sufficiently predictive of genomic interactions. When trained and tested in different cell types, TargetFinder performed better than the distance method for some BENGI datasets, albeit by a much smaller amount. Overall, there is much room for improvement for all of these methods, indicating that target gene prediction remains a challenging problem. BENGI datasets can be used by the community to tackle this problem while avoiding overfitting issues such as those identified for TargetFinder and PEP post publication [29, 30].
Our analyses highlight the differences between the genomic interactions identified by various experimental techniques (Table 1). For the same biosample (e.g., LCLs), the BENGI datasets generated by the same technique shared ~ 40% of their pairs (e.g., between RNAPII and CTCF ChIA-PET and between GEUVADIS and GTEx eQTLs), but the overlap between the datasets generated by different techniques were typically lower than 25% and could be as low as 1% (e.g., between eQTL and Hi-C). The BENGI datasets also differed significantly in terms of enhancer-gene distance and the enrichment of epigenomic signals at enhancers and TSSs. Thus, we still do not have a comprehensive understanding of the factors that regulate enhancer-gene interactions, and these different experimental techniques may capture different subsets of interactions.
Overall, all computational methods evaluated presented difficulty in predicting Hi-C pairs; even for the fixed ratio datasets, the Hi-C pairs consistently exhibited the lowest overall performance. This could be due to the technical challenges of calling Hi-C loops or the biological roles of these loops. For example, it has been noted that the detection of Hi-C loops requires care, and different loop-calling methods can produce markedly different results . Additionally, recent results from the Aiden lab demonstrated that gene expression did not change upon loop disruption via knocking out the key protein CTCF using a degron system . This finding may suggest that these CTCF Hi-C loops may play specific biological roles and may only represent a small subset of enhancer-gene interactions that have different properties compared to the other interactions.
Although the correlation-based methods did not outperform the distance method, the DNase-expression method did augment the distance method when combined with it. Furthermore, because correlation-based methods and supervised machine-learning methods survey orthogonal dimensions (biosample vs. assay), one promising future direction will be to combine these two types of approaches. For such future work to be fruitful, it will be beneficial to understand the differences in performance between the two correlation-based methods because the DNase-expression correlation method consistently outperformed the DNase-DNase correlation method. Several factors could contribute to this increased performance. First, gene expression may be a better readout for enhancer-gene interactions than a promoter’s chromatin accessibility, although these two features are correlated (average Pearson correlation r = 0.68). Second, for the DNase-expression method, Sheffield et al. generated normalized, batch-corrected matrices for the DNase-seq and gene expression data, while the DNase-DNase method used a read depth-normalized signal without any additional processing. To avoid imprecision in reimplementation, we downloaded these exact input datasets from the original publications (i.e., the exact normalized matrices for the DNase-expression method and the ENCODE2-processed DNase-seq bigWigs for the DNase-DNase method). The Sheffield et al. normalization technique may correct for outliers and batch effects, which would otherwise lead to spurious correlations impacting performance. Third, the DNase-DNase method merged 79 cell types into 32 groups based on cell type similarity. While this grouping may correct an uneven survey of the biosample space, it may lead to lower overall correlations for cell-type-specific interactions. We highlighted one such case involving the LCL-specific EH37E0853090-AKIRIN2 interaction, where the DNase-DNase method reported a correlation of 0.03, and the DNase-expression method reported a correlation of 0.12. The low correlation calculated by the DNase-DNase method was due to the combination of the four LCLs in one group, reducing the statistical power (Additional file 3: Figure S4b). These possible explanations should be carefully considered when designing future correlation-based and combined methods. Additionally, although these correlation-based methods did not perform well on the BENGI datasets, they may present better predictive power when used on curated sets of biosamples such as those obtained across embryonic development or cell differentiation. As we expand the number of cell types and tissues covered by BENGI, we hope to test these methods to evaluate their performance systematically.
Finally, we developed BENGI using an enhancer-centric model, as we were motivated by the Registry of cCREs. We hope to expand upon this approach to include a gene-centric model (i.e., for a given gene, determine the interacting enhancers) for future developments. Additionally, though BENGI datasets currently span 13 biosamples, the majority of the gene-ELS pairs derived from GM12878 or LCLs because these cells have been extensively profiled. Therefore, users of the benchmark should be cognizant that not all biosamples are profiled equally. Furthermore, the remaining BENGI datasets all derived from cell lines or heterogeneous tissues, none from primary cells. We will increase the representation of primary cells in our benchmark as soon as 3D chromatin and genetic interaction data on primary cells become available. We also plan to expand BENGI to include more functionally tested datasets such as the crisprQTLs as these results are published.
Precise and accurate identification of enhancer-gene links in a cell-type-specific manner remains a major challenge. Systematic comparisons using the BENGI datasets enabled us to identify the pitfalls in the current repertoire of computational methods, such as correlation-based approaches and the more complex, tree-based supervised algorithms. BENGI will aid the development of future enhancer-gene prediction models and improve our understanding of how regulatory elements control gene expression and ultimately the role that regulatory elements play in human diseases.
We downloaded the following ChIA-PET clusters generated by the Ruan lab  from the NCBI Gene Expression Omnibus (GEO) under accession number GSE72816.
We filtered each set of clusters by selecting ChIA-PET links that were supported by at least four reads (column 7 ≥ 4).
We downloaded the following Hi-C loops generated by the Aiden lab  from GEO under accession number GSE63525.
We downloaded crisprQTL data from Gasperini et al.  and mapped the reported genes to those annotated in GENCODE V19 and intersected the reported enhancer coordinates with cCREs-ELS in K562. A total of 4937 of the tested enhancers (85%) overlapped a K562 cCRE-ELS.
We used cCREs-ELS from V1 of the ENCODE Registry of cCREs available on the ENCODE portal found under the accessions provided in Additional file 1: Table S1a. We selected all cCREs-ELS (RGB color code 255,205,0) that were distal (i.e., greater than 2 kb from an annotated TSS, GENCODE v19).
Defining cCRE-gene pairs
We created cCRE-gene pairs using the Generate-Benchmark.sh. script, which is available on GitHub .
3D chromatin interactions (ChIA-PET, Hi-C, and CHi-C)
Using bedtools intersect (v2.27.1), we intersected the anchors of the filtered links (see above) with cCREs-ELS that were active in the same biosample. We retained all links with an anchor that overlapped at least one cCREs-ELS and with the other anchor within ± 2 kb of a GENCODE V19 TSS. We tagged all links with an anchor within ± 2 kb of the TSSs of multiple genes as ambiguous pairs and created a separate version of each dataset with these links removed.
Genetic interactions (eQTLs)
For eQTLs, we retrieved the location of each reported SNP from the eQTL file and intersected these loci with cCREs-ELS that were active in the same tissue type using bedtools intersect. We then paired the cCRE-ELS with the gene linked to the SNP. We only considered SNPs that were directly reported in each of the studies; we did not expand our set using linkage disequilibrium due to the mixed populations surveyed by GTEx.
For crisprQTLs, we intersected the reported positive enhancers with cCREs in K562 using bedtools intersect. We then paired the cCRE-ELS with the gene linked to the reported enhancer.
Generation of negative pairs
To generate negative pairs, we calculated the 95th percentile of the distances of positive cCRE-gene pairs for each dataset, with distance defined as the linear distance between the cCRE-ELS and the closest TSS of the gene using bedtools closest. For each cCRE-ELS among the positive cCRE-gene pairs that fell within this 95th percentile, we considered all other genes within the 95th percentile distance cutoff as negatives. Because our model is enhancer-centric, the same promoter may belong to both positive and negative sets, paired with different enhancers. For datasets with ambiguous links removed (ChIA-PET, Hi-C, and CHi-C), we also excluded genes in these ambiguous pairs as negatives. For the fixed ratio datasets, we also excluded genes that were in the positive pairs for the cCREs-ELS in other BENGI datasets before randomly selecting the negatives. If a cCRE-ELS exhibited fewer than four negative pairs, then it was excluded from this fixed ratio set.
Assignment of chromosome CV
For each BENGI dataset, we calculated the number of cCRE-gene pairs on each chromosome and assigned chromCV groups accordingly. The chromosome with the most pairs (often chr1) was assigned its own group. Then, we iteratively took the chromosome with the most and fewest pairs and combined them to create one CV group. In total, the 23 chromosomes (1–22, X) were assigned to 12 CV groups.
Characterization of BENGI datasets
Clustering of dataset overlap
For each pairwise combination of the GM12878/LCL BENGI datasets, we calculated the overlap coefficient of positive cCRE-gene pairs. Then, using hclust, we performed hierarchical clustering with default parameters.
For biosamples with matching RNA-seq data, we downloaded corresponding RNA-seq data from the ENCODE portal (accessions provided in Additional file 1: Table S1b, Additional file 3: Figure S1). For each gene, we calculated the average TPM between the two experimental replicates. To test whether there was a significant difference between BENGI datasets with or without ambiguous pairs, we used a Wilcoxon test.
For cCREs-ELS in each positive pair across the GM12878 and LCL BENGI datasets, we calculated the average ChIP-seq signal for 140 transcription factors and DNA-binding proteins. We downloaded the ChIP-seq signal from the ENCODE portal (accession available in Additional file 2: Table S2b) and used UCSC’s bigWigAverageOverBed to calculate the average signal across each cCRE. For each BENGI dataset, we then reported the average signal for all cCREs.
Implementation of cCRE-gene prediction methods
We identified the closest TSS to each cCRE-ELS using bedtools closest and GENCODE V19 TSS annotations. We compared two options: use of the full set of GENCODE TSSs (with problematic annotations removed) or use of only protein-coding GENCODE TSSs. To evaluate performance, we calculated the overall precision and recall for each BENGI dataset (Script: Closest-Gene-Method.sh).
For each cCRE-gene pair, we calculated the linear distance between the cCRE-ELS and the gene’s nearest TSS. To rank these pairs, we took the inverse (1/distance) and calculated the area under the precision-recall curve (AUPR) using a custom R script that uses the PROCR library (Script: Run-Distance-Method.sh).
DNase-DNase correlation method
We used the same DNase-seq datasets as Thurman et al. employed for their DNase-DNase method. We downloaded these legacy datasets generated during ENCODE Phase 2 from the UCSC genome browser. For each cCRE-gene pair, we curated a set of cCREs-TSS by determining the closest cCRE for each TSS of the gene. We then calculated the average DNase signal across the nucleotide positions in the cCRE-ELS and cCRE-TSS for each DNase dataset. For similar cell types, as determined by Thurman et al., we averaged the DNase signal among these similar cell types in each of the 32 groups to generate 32 values for each cCRE-ELS and cCRE-TSS. We then calculated the Pearson correlation coefficient (PCC) for each cCRE-ELS and cCRE-TSS pair. If a gene was annotated with multiple TSSs, we selected the highest PCC among all the cCRE-ELS and cCRE-TSS comparisons. We ranked the predictions by their PCC and calculated the AUPR using the PROCR library (Script: Run-Thurman.sh).
DNase-expression correlation method
To match the legacy data and normalization methods originally used by previous investigators , we downloaded normalized counts across 112 cell types for DNase-hypersensitive sites or DHSs (dhs112_v3.bed) and genes (exp112.bed) from http://big.databio.org/papers/RED/supplement/. We intersected each cCRE-ELS with the DHSs previously curated . If a cCRE overlapped with more than one DHS, we selected the DHS with the strongest signal for the cell type in question (i.e., the DHS with the strongest signal in GM12878 for GM12878 cCREs-ELS). For each cCRE-gene pair, we then calculated the Pearson correlation coefficient using the 112 normalized values provided in each matrix. cCRE-gene pairs that did not overlap with a DHS or did not have a matching gene in the expression matrix were assigned a score of − 100. (Script: Run-Sheffield.sh).
We reimplemented PEP-motif to run on our cCRE-gene pairs with chromCV. Similar to Yang et al., we calculated motif frequency using FIMO  and the HOCOMOCO database (v11 core, ). We also added ± 4 kb of padding to each cCRE-ELS as originally described. We concatenated cross-validation predictions and calculated AUPR values using PROCR (Script: Run-PEPMotif.sh).
We reimplemented TargetFinder to run on our cCRE-gene pairs with chromCV. For features, we used the identical datasets described by Whalen et al. for each cell type. We concatenated the cross-validation predictions and calculated AUPR values using PROCR (Script: Run-TargetFinder-Full.sh).
To dissect features contributing to TargetFinder’s high performance on ChIA-PET pairs, we created four subsets of pairs for the GM12878 RNAPII and CTCF ChIA-PET datasets.
A subset with a 1:2 ratio of positives to negatives which was created by subsampling 1 positive link for each cCREs and 2 negative links for each cCRE. This was analogous to the 1:4 fixed ratio method described above.
A “promoter-matched” subset that only includes pairs from promoters that are in at least one positive and one negative pair. We then subsample to achieve a fixed 1:2 ratio of positives to negatives.
A “distance-matched subset for which we define 5 distance quantiles based on the distribution of positive pairs and sample equally from each bin maintaining a 1:2 ratio of positives to negatives.
A “promoter-distance-matched” subset for which we match for promoter use as described in (2) and distance as described in (3). Once again, we maintained a 1:2 ratio of positives to negatives.
To test the cross-cell-type performance of TargetFinder, we generated core4 and core3 models for each cell type and then evaluated the models in other cell types. To prevent any overfitting, we assigned the chromCV of the test sets to match those of the training sets.
Availability of data and materials
All processed datasets and corresponding scripts are available and version controlled at https://github.com/weng-lab/BENGI . cCREs are available at the ENCODE portal (www.encodeproject.org) under the following accessions: ENCSR480YCS, ENCSR451UAK, ENCSR502KJC, ENCSR981LWT, ENCSR593JKE, ENCSR376OQT, ENCSR970YPW, ENCSR282MAC, ENCSR551OAQ, ENCSR529CSY, ENCSR117DKP, ENCSR117AJU, and ENCSR935NWB. ChIA-PET data is available at GEO under accession GSE72816. Hi-C data is available at GEO under accession GSE63525. CHi-C data is available from ArrayExpress under accession E-MTAB-2323. Cis-eQTLs from the GEUVADIS project are available at ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/GEUV/E-GEUV-1/analysis_results/EUR373.gene.cis.FDR5.all.rs137.txt. GTEx single-tissue cis-eQTLs are available in GTEx_Analysis_v7_eQTL.tar.gz from https://gtexportal.org/home/datasets. crisprQTL data is available from the supplemental tables in Gasperini et al. . ENCODE RNA-seq experiments are available at the ENCODE portal under the following accessions: ENCSR000COQ, ENCSR000CPR, ENCSR000AEM, ENCSR000CTQ, ENCSR000CPL, ENCSR830HIN, ENCSR860HAA.
Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011;473:43–9.
Rajagopal N, Xie W, Li Y, Wagner U, Wang W, Stamatoyannopoulos J, et al. RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State. Singh M, editor. PLoS Comput Biol Public Libr Sci. 2013;9:e1002968.
He Y, Gorkin DU, Dickel DE, Nery JR, Castanon RG, Lee AY, et al. Improved regulatory element prediction based on tissue-specific local epigenomic signatures. Proc Natl Acad Sci U S A. 2017;114:E1633–40.
Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh P-R, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genet. 2015;47:1228–35.
Lettice LA, Heaney SJH, Purdie LA, Li L, de Beer P, Oostra BA, et al. A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum Mol Genet. 2003;12:1725–35.
Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Sci Am Assoc Adv Sci. 2009;326:289–93.
GTEx Consortium, Laboratory, Data Analysis &Coordinating Center (LDACC)—Analysis Working Group, Statistical Methods groups—Analysis Working Group, Enhancing GTEx (eGTEx) groups, NIH Common Fund, NIH/NCI, et al. Genetic effects on gene expression across human tissues. Nature. 2017;550:204–13.
Sheffield NC, Thurman RE, Song L, Safi A, Stamatoyannopoulos JA, Lenhard B, et al. Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions. Genome Res. 2013;23:777–88.
Moore JE, Pratt HE, Purcaro MJ, Weng Z. A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods. Github. 2019. https://github.com/weng-lab/BENGI. Accessed 1 Dec 2019.
Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011;27:1017–8.
Kulakovskiy IV, Vorontsov IE, Yevshin IS, Sharipov RN, Fedorova AD, Rumynskiy EI, et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 2018;46:D252–9.
This project was developed and designed by JEM and ZW. JEM performed the analyses. JEM, HEP, and MJP jointly contributed to the creation of the Registry of cCREs. All authors read and approved the final version of the manuscript.
Expression levels of genes in BENGI pairs. Figure S2. PR curves for unsupervised models. Figure S3. Correlation between BENGI pairs. Figure S4. Correlation methods perform poorly due to the ubiquity of promoters. Figure S5. PR curves of the supervised methods evaluated with BENGI datasets.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.