Creation and ranking of genome-wide enhancer-to-target gene definitions (EnTDefs)
Several approaches to define human enhancer locations and their target genes have been proposed in the literature, but no systematic study has been performed to evaluate their performance separately or in combination on a genome-wide scale. To determine the best sets of human enhancers and their distal gene targets, we generated a total of 1860 genome-wide Enhancer-Target gene Definitions (EnTDefs) using existing experiments and/or literature-derived data, and systematically evaluated their performance. This was done by applying all possible combinations of methods for defining (1) enhancer region locations, identified from four data sources (ChromHMM [43], DNase-seq [38], FANTOM5 [37, 44, 45], and Thurman [38]), and (2) enhancer-target gene links, defined by four different methods (ChIA-pet data [“ChIA”] [46, 47], DNAase-signal correlation [“Thurman”] [38], gene expression correlation [“FANTOM5”] [45], and loop boundaries with convergent CTCF motif [“L”] [48]), including combinations using multiple of each (see “14” for details). Overall, these included a total of 1,768,201 possible individual enhancer-target links across >500 cell types by integrating all of the 4 enhancer-defining datasets and all of the 4 enhancer-gene link datasets. These enhancer-target links were defined from 685,921 enhancers and 21,094 linked target genes. Figure 1 demonstrates the workflow for the creation and evaluation of these 1860 EnTDefs. For the “L” enhancer-gene linking method, we evaluated the loops with up to 3 genes (L1: one gene, L2: ≤ two genes, or L3: ≤ three genes), allowing the links between the enhancer to any of the included genes within the loop. Because current knowledge of enhancers is far from complete and the experimental data that assay enhancers to target genes is limited, the genome coverage of EnTDefs defined by the experimentally and/or computationally derived methods (Fig. 1A: four enhancer-defining methods and four enhancer-target gene linking methods) was expected to be low. Therefore, we extended the enhancer regions up to 1 kb and/or assigned regions outside of enhancers and promoters (within 5kb of a transcription start site (TSS)) to the gene with the nearest TSS (Fig. 1A: Extension and Additional link), resulting in 100% coverage of distal genomic regions (>5 kb of TSS). All of the 1860 EnTDefs were evaluated and ranked based on how well they performed in gene set enrichment (GSE) testing with genes’ distal ChIP-seq peaks. Specifically, the Gene Ontology biological process (GO BP) enrichment results from 87 ENCODE ChIP-seq datasets for 34 distinct transcription factors (TFs) were compared with the curated GO BP terms annotated to the same tested TFs (GO annotation by GO database) using F1 scores (see “14”). EnTDefs demonstrating higher concordance ranked higher, as they were better able to identify the known functions of the TFs based on their distal binding regions (non-promoters).
Overview of the EnTDef characteristics
We first investigated the characteristics of the 1860 EnTDefs by comparing them to simply assigning distal genomic regions (i.e., >5 kb from a TSS) to the genes with the nearest TSS (>5 kb Locus Definition [LocDef]) (Fig. 2A, Additional file 1: Fig. S1). The EnTDefs were ranked in decreasing order by their average F1 score across 34 TFs, and the top 741 EnTDefs (~40%) were found to significantly outperform the >5 kb LocDef (Wilcoxon signed-rank test, FDR < 0.05). The best-performing EnTDef (No. 1 ranked) was defined by DNase-seq plus FANTOM5 enhancers and ChIA, Thurman, and FANTOM5 enhancer-target gene link methods with the “nearest_All” addition. For the top 741 EnTDefs, the percentage of genome covered and percent of distal peaks caught (outside of 5 kb regions around TSSs) was as high as 100% (89–100%), the median number of genes assigned to each enhancer was 2 (range of 1–2), and the median number of enhancers assigned to each gene was 20 (range of 2–98). Out of the 741 EnTDefs, those ranked 2 through 19 were not significantly worse than the best-performing EnTDef (Wilcoxon signed-rank test, p > 0.01. Additional file 2: Table S1), suggesting that these 19 EnTDefs performed equally well. This finding was robust to the specific set of GO biological process (GOBP) annotations used (i.e., with or without IEA-based GO to gene annotations; see “14,” data not shown).
To assess the relative benefit of each method used for enhancer definition, enhancer extension, and enhancer-target gene link assignments, we compared the F1 scores of each of the top 741 EnTDefs containing a particular method to the F1 scores of other EnTDefs excluding that method only (keeping everything else the same). Using paired Wilcoxon tests, we calculated the percent of EnTDefs for each tested method that showed significantly increased F1 scores across the ChIP-seq datasets (Fig. 2B). Adding FANTOM5 significantly improved the performance of >50% of EnTDefs, whereas adding the ChromHMM method only enhanced the performance for ~5%. DNase-seq and Thurman methods ranked in the middle, improving ~24 and 16% of EnTDefs, respectively. EnTDefs without enhancer region extension significantly increased the F1 scores for ~60% of EnTDefs, while the ones with 1k bp extension only improved ~7% of EnTDefs. It is not surprising that all of the top 741 EnTDefs included the “nearest_all” addition, since this addition significantly increased genome coverage by assigning all regions outside enhancers and promoters to the closest gene (>5 kb LocDef), leading to improved sensitivity and thus F1 score (Fig. 2A). On the other hand, the fact that these 741 EnTDefs outperformed the >5 kb LocDef suggests that the “smart” enhancer to target gene assignments more accurately capture real biological regulatory elements for distal enhancer regions as compared to the simplistic assignment to nearest genes. FANTOM5 and ChIA enhancer-gene assignment methods significantly improved ~70% of EnTDefs, while the L and Thurman methods only improved ~1.7 and 9% of EnTDefs. Including more than one gene in the CTCT ChIA-PET loops (L2/L3 methods) failed to improve the performance. In addition, 70% of the 741 EnTDefs were generated using combinations of at least two different methods for enhancer definitions (ChromHMM, DNase-seq, FANTOM5, and/or Thurman) and 100% of them contained at least two enhancer-gene assignment methods (ChIA, FANTOM5, L, and/or Thurman), illustrating the importance of the integration of multiple data sources and methods to improve the performance of enhancer to target gene assignments.
EnTDefs plus promoter regions outperform the nearest TSS method
Our analyses thus far have focused on the assessment of distal gene regulation. However, often the goal is to assess the functional regulation from anywhere in the genome, including binding both distal and proximal to TSSs. One commonly used method for ChIP-seq GSE testing is to link all peaks to the gene(s) with the nearest TSS, hereinafter referred to as the “nearest TSS” method (Additional file 1: Fig S1 “nearest TSS” LocDef), resulting in all peaks having at least one assigned gene. EnTDefs were generated for distal regions (outside the 5-kb windows around TSSs) and any regions within 5 kb of a TSS were ignored, whereas the “nearest TSS” method includes all genomic regions. Thus, to compare fairly with the “nearest TSS” method, we added promoter regions to the top 10 ranked EnTDefs, referred to as “EnTDef_plus5kb.” That is, peaks within 5 kb of a TSS were assigned to the nearest gene (Additional file 1: Fig. S1 “5 kb” LocDef), while distal peaks were assigned according to the EnTDef. All ten of the EnTDef_plus5kbs significantly outperformed the “nearest TSS” method (~0.05 increase in average F1 score, Wilcoxon signed-rank test, p < 0.0001) (Fig. 2C), using the same evaluation method based on F1 scores as used above (see “14”). Since the only difference between these is how distal binding events were defined, the improved performance of our EnTDef_plus5kbs is directly attributable to the distal enhancer-target gene links, and this comparison demonstrates that these distal links provide regulatory information beyond that provided by promoters and nearest genes.
We next determined if our “smart” EntDefs using only distal binding events could even outperform the use of all peaks (promoter and enhancer) with naïve assignments to the genes with the nearest TSS. When compared with the “nearest TSS” method, the top 10 best-performing EnTDefs showed slightly lower F1 scores (~0.03 lower), but the difference among the top half of them were not significantly different from “nearest TSS” (Wilcoxon signed-rank test, p > 0.05). Thus, although they did not outperform it, the best were not significantly worse. This illustrates the great importance of regulation from promoters in GSE testing.
Two other commonly used GSE methods for genomic regions, GREAT [39] and Fisher’s exact test (FET) using peaks within 5 kb of a TSS (Additional file 1: Fig. S1 “5kb” LocDef), were also evaluated using the same scheme. Notably, the three GSE testing methods (Poly-Enrich, GREAT and FET using 5 kb LocDef to assign peak to gene) performed equally well (Friedman test, p = 0.91), but significantly worse than the top 10 EnTDefs (distal regions only) (Fig. 2C, average F1 = 0.45 vs 0.47, Wilcoxon rank-sum test, p < 0.007). In addition, both the top 10 EnTDefs and 5 kb LocDef (i.e., assigning promoters to the nearest gene) significantly outperformed the >5 kb LocDef (i.e., the naïve approach of assigning distal regions to the nearest gene) (average F1 = 0.47, 0.45, vs 0.27, Wilcoxon signed-rank test, p = 2.37 × 10−14 and 1.32 × 10−8 respectively). In summary, although the naïve approach of linking distal regions to the nearest gene (>5 kb LocDef) did not outperform the use of promoter data only (5 kb LocDef), the use of distal binding events with “smart” gene assignments (EnTDefs) did outperform the use of promoter data only. Incorporation of these 5-kb promoter regions (5 kb LocDef) into the top 10 EnTDefs (“EnTDefs_plus5kb”) significantly further improves their performance (better than “nearest_TSS” approach), indicating both promoter and distal regions provide non-overlapping, independent evidence for regulatory programs. These findings illustrate the importance of accurately modeling regulation from enhancers and that when done well, enhancers have the potential to provide more regulatory information than promoters. We conclude that GSE testing using our top EnTDefs exceeds the commonly used nearest distance-based and promoter-only-based GSE approaches.
Our EnTDefs are generalizable to different cell lines
Next, we sought to investigate whether the EnTDefs (which were selected based on their performance in GM12878, H1-HESC and K562 cell lines) can perform equally well testing ChIP-seq data from different cell lines (A549, HEPG2, HUVEC, and NB4). Surprisingly, the average F1 score in test ChIP-seq datasets (different cell lines) was significantly higher than that from the evaluation ChIP-seq datasets (original cell lines; average F1 = 0.59 vs. 0.50, Wilcoxon sum-rank test, p = 0.00098) (Fig. 2D, and Additional file 1: Fig. S2). This may be due to the test ChIP-seq datasets containing more peaks than the evaluation datasets (Additional file 1: Fig. S3A, Wilcoxon sum-rank test, p = 0.092), and indeed we found that the F1 scores were significantly correlated with the number of peaks (Additional file 1: Fig. S3B, Pearson’s correlation r = 0.65, p = 4.57 × 10−6). After correcting for the number of peaks, the association between the F1 score and dataset type was decreased, although test dataset F1 scores still remained higher than the evaluation F1 scores (p < 0.05; linear model with log10 peak number as covariate). Furthermore, the average F1 scores of the top10 EnTDefs and nearest_TSS in the evaluation dataset were strongly correlated with those in the test dataset (r = 0.94, p = 1.23 × 10−5), illustrating that an important variable in determining F1 score is the TF and/or antibody. The findings indicate that the performance of the top selected EnTDefs are independent of the cell types of ChIP-seq datasets, but likely strongly influenced by the quality of the datasets themselves (e.g., the specificity and efficiency of an antibody, ChIP quality, number of peaks). We reasoned that the EnTDefs were created based on the combinations of diverse data sources stemming from >500 different cell types, resulting in a consensus set of enhancer and gene assignments across various cell types, and therefore representative of the background interactions between enhancer and target genes across many cell types. The high generalizability of our top EnTDef makes it feasible to integrate with GSE testing in a cell-type-independent manner.
In addition, we applied our EnTDefs on a completely independent set of ChIP-seq experiments in GSE testing and evaluated their performance using a different metric. That is, we used data that are both from completely different, non-overlapping transcription factors (TFs) and completely different, non-overlapping cell types. The new datasets include 31 ENCODE ChIP-seq experiments of 9 cell lines and 14 transcription factors (TFs) (Additional file 2: Table S2), which all passed quality controls according to the Cistrome project (http://cistrome.org/db/#/about). The receiver operating characteristic (ROC) and precision-recall (PR) curves were generated for each ChIP-seq dataset when comparing the significant GOBP terms with the assigned ones for the tested TF (see “16”) at a series of GSE p-value cutoffs, and the area under PRC (AUPRC) and area under ROC (AUROC) were calculated. As compared to the baseline methods (nearest TSS and >5 kb), the top10 EnTDefs with or without plus 5 kb locus definition had both higher overall AUPRC and AUROC across the 31 ChIP-seq datasets (Additional file 1: Fig. S3C). This provides independent evidence of the outperformance of our EnTDef compared to the commonly used “nearest TSS” method, illustrating the robustness of our top EnTDefs across a broad range of datasets.
General EnTDefs perform comparably to cell-type-specific EnTDefs
To contrast with the EnTDefs generated by integrating data for many cell types, hereafter called “general EnTDefs,” we created 420 “cell-type-specific EnTDefs” (CT-EnTDef) for each of the four cell types (GM12878, H1hESC, K562, and MCF7) using ChIA-PET datasets of a particular cell type, and ranked the CT-EnTDefs by average F1 scores of the evaluation ChIP-seq datasets from the same cell type (see “14”). Since many enhancers and regulatory relationships between enhancer and target genes are considered to be tissue and cell-type-specific, we sought to examine how the general EnTDefs perform when compared with CT-EnTDefs. For each tested TF (the average number of TFs tested in each cell type is ~55, ranging from 4 to 96, see Additional file 2: Table S3), the average F1 scores were calculated across the top 10 CT-EnTDefs of each cell type, or across the corresponding general EnTDefs with the same combinations of enhancer definition and enhancer-gene link methods. To prevent bias due to the incorporated cell-type-specific enhancer-gene pairs in the general EnTDefs, the ChIA-PET datasets of the particular cell type were excluded from the comparative general EnTDefs (see “14,” Fig. 3C). Three types of comparisons were performed for each TF of a particular cell type: (i) general EnTDef vs. CT-EnTDef using the same cell type (same-CT-EnTDefs), (ii) general EnTDef vs. CT-EnTDef using a different cell type (diff-CT-EnTDefs), and (iii) same CT-EnTDefs vs. different CT-EnTDefs. Notably, there was no significant difference in the average F1 scores among the three comparative EnTDefs (i.e., same CT-EnTDefs, different CT-EnTDefs, and general EnTDefs) for any cell type (Fig. 3A, Wilcoxon sum-rank test, p ≥ 0.2; three groups: Kruskal-Wallis test, p ≥ 0.3). We also observed that the average F1 scores were significantly correlated between the same-CT-EnTDef and diff-CT-EnTDef for all four cell types with Pearson’s correlation (in GM12878, H1hESC and MCF7, R ≥ 0.9, while in K562, R = 0.76) (Fig. 3B and Additional file 1: Fig. S4, p < 0.0001), consistent with our finding above that TF or antibody used for ChIP-seq explains a high degree of variation in F1 scores. This correlation trend still held when looking across individual TFs and EnTDefs rather than averages (i.e., F1 score per TF per EnTDef, Additional file 1: Fig. S5).
As shown in Fig. 3C, regardless of the type of EnTDef (general EnTDefs, same-CT-EnTDefs, and diff-CT-EnTDefs) used for evaluation, the average F1 score across TFs and EnTDefs were similar, with the difference ranging from 0 to 0.13. Taken together, these findings suggest that CT-EnTDefs are overall comparable to general EnTDefs, and the benefit of using CT-EnTDefs is minor and depends on the quality and quantity of data for a particular cell type (e.g., K652 vs. others, Fig. 3B). This is good news since it is costly and difficult to generate cell-type-specific ChIA-PET experiments, which are required to create the corresponding CT-EnTDef. In contrast, the general EnTDefs, which capture real enhancer and target gene interactions in a similar way to CT-EnTDefs, are more practically and economically favorable for GSE testing.
Independent validation of our EnTDef ranking approach
We sought to further evaluate our EnTDef ranking using a curated benchmark of enhancer-gene interactions (BENGI) which include both true positive and true negative pair [49]. By overlapping the enhancer-gene pairs of our top 10, middle 10 (ranked at 732–741), and bottom 10 EnTDefs, as well as top10 EnTDefs with 5 kb locus definition (“EnTDef.top_plus5kb”) and the baseline methods (nearest TSS and >5 kb locus definitions), with BENGI, we calculated the F1 score, sensitivity, specificity, and precision (see “14”). The “nearest TSS” method is directly comparable to (“EnTDef.top_plus5kb”), while the “5kb_outside” method is directly comparable to (“EnTDef.top”). Consistent with our ranking approach, the top 10 EnTDefs showed the highest average F1 scores, with the values decreasing for the middle and bottom 10 EnTDefs sequentially (Fig. 4A). The average F1 scores of the top10 EnTDef were significantly higher than “nearest TSS” (EnTDef vs nearest_TSS ANOVA test in BENGI with fixed positive/negative ratio [1:4]: p-value = 2.74 × 10−104, in BENG with natural positive/negative ratio [much more negative than positive pairs]: p-value = 5.48 × 10−3), although the extent of the increase became smaller in the BENGI natural ratio dataset. The ranks of our top/middle/bottom EnTDefs and baseline methods based on the BENGI-derived F1 scores vs. those based on our original GSE testing-derived F1 scores were highly correlated (Fig. 4B), indicating general concordance between the two main benchmarks used. However, a difference is that BENGI consistently ranked methods without the 5-kb promoter regions (EnTDef.top and 5kb_outside) higher than the ones with those regions (EnTDef.top_plus5kb and nearest_tss), whereas the GSE benchmark did the opposite.
To examine the influences on the F1 scores, we assessed sensitivity and specificity separately. The overall sensitivity of the top1 EnTDef showed a 91% increase compared to that of the “nearest TSS” method (0.61 vs 0.32), while the specificity and precision decreased by ~20 and ~14%, respectively. Notably the average sensitivity of the top10 EnTDefs and EnTDefs_plus5kb were significantly increased as compared to “nearest TSS” and “>5 kb” in all BENGI subsets, while the specificity showed a small decrease (Additional file 1: Fig. S6A). The same trend can be observed in individual EnTDefs plotted as sensitivity versus (1-specificity) (Additional file 1. Fig. S6B).
Next, we directly compared the 30 EnTDefs with independent enhancer-gene pair datasets, including 5 computationally derived datasets (FOCS [50], GeneHancer [51], JEME [52], PEGASUS [53, 54], and RIPPLE [55]) and 2 experiment-based datasets (HACER [56] and the dataset from Jung et al. (referred to as RB) [57]) (Additional file 2: Table S4). The overlap coefficient (i.e., the number of enhancer-gene pairs shared between two datasets divided by the number of pairs in the smaller dataset [49]) was used to rank the EnTDefs (see “14”). The overlap coefficient-based ranks of the 30 EnTDeFs were significantly correlated with their original F1 score-based ranks in four out of the five computationally derived datasets, and both of the two experimental datasets (Pearson’s correlation ranges from 0.6 to 0.9, p < 0.0001, Fig. 4C, D). Moreover, we evaluated the GSE performance of the same set of top, middle, and bottom ranked EnTDefs for two dataset pairs: genome-wide DNA methylation (WGBS) and RNA-seq data for the same tumor samples comparing two subtypes of HPV-associated head and neck cancer, and ATAC-seq and RNA-seq datasets studying overexpression of the transcription factor Sox17 in the same cells (see “14”). In both cases, the top ranked EnTDefs with the regulome data were better able to recapitulate the biological processes changed in the gene expression data performed for the same experiment than the middle or bottom ranked EnTDefs (see “14” and Additional file 1: Fig. S7). These findings validate that the GSE-derived F1 score-based ranking captures true biological signal and is a valid approach to prioritize the EnTDefs.
Comparison of the top EnTDef with other enhancer-gene pair datasets using GSE
Next, we compared the GSE performance between the best EnTDef and the aforementioned seven independent enhancer-gene pair datasets. Using the same ChIP-seq GSE evaluation method, we calculated the F1 scores of the 87 ChIP-seq datasets for each of the comparative enhancer-gene pair datasets and the combined datasets (best EnTDef + comparative dataset) (see “14”). The best EnTDef significantly outperformed the independent datasets and the combined ones (Wilcoxon signed-rank test, p < 0.05, Fig. 5A). Remarkably, integrating the EnTDef into the comparative datasets improved their performance, but in every case failed to outperform our top EnTDef itself. Similarly, the best CT-EnTDef performed significantly better than the two cell-type-specific datasets (CT-RIPPLE and CT-HACER) in all three investigated cell types (GM12878, H1hESC and K562) (Wilcoxon signed-rank test, p < 0.05, Fig. 5B). Interestingly, the performance of the combined cell-type-specific datasets (best CT-EnTDef + CT-RIPPLE, or best CT-EnTDef + CT-HACER) were comparable to that of the best CT-EnTDef. This suggests that the best EnTDef leverages sufficiently comprehensive enhancer-gene interactions based on the state-of-the-art knowledge in this field, and that coupled GSE is able to capture the biological regulatory programs from regulome data.
Incorrect gene assignments by nearest distance method are not random
Since enhancers are known to be located up to 1 Mbp away from their regulatory genes [14, 58], several interceding genes can reside between a TF binding site (peak) in an enhancer and its target gene(s), as modeled by our EnTDefs (Additional file 1: Fig. S8). In contrast, the nearest distance method simply links a peak to the gene with the nearest TSS without accounting for interceding genes. By ranking the genes based on the average number of interceding genes across the enhancers that target them, we investigated whether the number of interceding genes is randomly distributed across genes and GO terms, or if there are GO terms significantly enriched with genes having more or fewer interceding genes [59]. We investigated the best-performing EnTDef excluding the “nearest_all” addition, in order to assess the “smart” enhancer-target links only. The genes least likely to have interceding genes were found to be significantly enriched in G protein-coupled receptor activity (FDR = 1.41 × 10−14), olfactory receptor activity (FDR = 6.21 × 10−12), detection of chemical stimulus (FDR = 3.23 × 10−11), phenol-containing compound metabolic process (FDR = 1.91 × 10−4), GABA-ergic synapse (FDR = 2.35 × 10−4), RISC complex (FDR = 2.39 × 10−4), postsynaptic membrane (FDR = 4.13 × 10−4), and behavior (FDR = 4.71 × 10−4) (Fig. 6A). These GO terms enriched with genes least likely to have interceding genes (lower-ranked genes) are most likely to be correctly assigned by the nearest distance method (Additional file 1: Fig. S1: >5 kb LocDef), and thus most easily detectable by current GSE testing. Conversely, the GO terms enriched with higher numbers of interceding genes (upper ranked genes) were mRNA metabolic process (FDR = 8.09 × 10−8), regulation of catabolic process (FDR = 8.40 × 10−8), chromatin organization (FDR = 2.53 × 10−7), kinase binding (FDR = 1.75 × 10−6), heterocycle catabolic process (FDR = 3.22 × 10−6), chromatin (FDR = 7.25 × 10−6), hemopoiesis (FDR = 9.47 × 10−6), and RNA processing (FDR = 2.27 × 10−5) (Fig. 6A). Those GO terms are least likely to be assigned by the nearest distance method, and most likely missed using current methods for GSE testing.
To determine if this observation is robust to different EnTDefs, we performed the same analysis on all top 10 best-performing EnTDefs without the “nearest_all” addition, and combined the results by calculating FDR-adjusted harmonic mean p-values, followed by removing redundant terms (see “14”). Consistently, G protein-coupled receptor activity, olfactory receptor activity, RISC complex, and postsynaptic membrane were still the top 5 enriched terms for the genes with fewer interceding genes, and similarly, regulation of catabolic process, chromatin organization, kinase binding, and heterocycle catabolic process were the top 5 enriched terms in upper ranked genes with more interceding genes (Fig. 6B). These findings indicate that both the genes with the most and fewest interceding genes are not random: chemical stimulus and neuron-related genes can be easily assigned with the nearest distance method, whereas metabolic processing and chromatin organization genes may be frequently missed. It is concordant with the knowledge that enhancers regulate genes via long-range chromatin interactions, which are able to be captured by our EnTDefs.
Guidance for selecting a peak-to-gene assignment method in GSE analysis
The first step in GSE testing of cis-regulome data, such as TF binding sites or chromatin marks from ChIP-seq, is to assign the genomic regions or peaks to their target genes. The different assignment methods can lead to variable enrichment results and FP and/or FN findings, as discussed above (nearest distance method vs. EnTDef). To avoid misinterpretation of genome-wide regulatory data, we need to select an appropriate LocDef method with care, which should be specific to the particular research question and the genomic regions of interest. Figure 7 summarizes three general categories of research questions and the corresponding regions of interest: (i) the 5 kb or 1 kb LocDef should be selected when interested in how a TF and/or chromatin mark regulates gene expression from promoters; (ii) the EnTDef (enhancer) should be selected when interested in how a TF and/or chromatin mark regulates gene expression from distal regions; and (iii) when the comprehensive regulatory signature is of interest, including both promoter and distal regions, our EnTDef plus 5kb LocDef (enhancer.5kb) should be selected. The promoter LocDef has the lowest genome coverage (10% for <5 kb LocDef and 2% for <1 kb LocDef), while the EnTDef plus 5 kb has 100% genome coverage, and the EnTDef has intermediate genome coverage (90%). We incorporated our top-performing EnTDef and EnTDef.plus5kb into the Bioconductor package chipenrich [42] and the ChIP-Enrich website (chip-enrich.med.umich.edu), allowing users to select the most suitable genomic region-gene assignment methods, gene sets, and GSE method to correctly interpret their genome-wide regulatory data. In addition, we provide a peak-to-gene assignment functionality in our GSE Suite (gsesuite.dcmb.med.umich.edu), by which users can select any possible combination of enhancer location and enhancer-to-gene target methods (as described in this study) and obtain the gene assignments for a user uploaded list of genomic regions, based on the selected EnTDef, or other method (e.g., promoters, exons, introns or anywhere in the genome).