- Research
- Open access
- Published:
DNA-binding factor footprints and enhancer RNAs identify functional non-coding genetic variants
Genome Biology volume 25, Article number: 208 (2024)
Abstract
Background
Genome-wide association studies (GWAS) have revealed a multitude of candidate genetic variants affecting the risk of developing complex traits and diseases. However, the highlighted regions are typically in the non-coding genome, and uncovering the functional causative single nucleotide variants (SNVs) is challenging. Prioritization of variants is commonly based on genomic annotation with markers of active regulatory elements, but current approaches still poorly predict functional variants. To address this, we systematically analyze six markers of active regulatory elements for their ability to identify functional variants.
Results
We benchmark against molecular quantitative trait loci (molQTL) from assays of regulatory element activity that identify allelic effects on DNA-binding factor occupancy, reporter assay expression, and chromatin accessibility. We identify the combination of DNase footprints and divergent enhancer RNA (eRNA) as markers for functional variants. This signature provides high precision, but with a trade-off of low recall, thus substantially reducing candidate variant sets to prioritize variants for functional validation. We present this as a framework called FINDER—Functional SNV IdeNtification using DNase footprints and eRNA.
Conclusions
We demonstrate the utility to prioritize variants using leukocyte count trait and analyze variants in linkage disequilibrium with a lead variant to predict a functional variant in asthma. Our findings have implications for prioritizing variants from GWAS, in development of predictive scoring algorithms, and for functionally informed fine mapping approaches.
Background
Only a very small (2%) proportion of the human genome codes for protein-coding genes. Yet, genetic variants in the non-coding genome are known to contribute to Mendelian diseases and complex traits. Genome-wide association studies (GWAS) have uncovered thousands of genetic variants that are associated with traits or alter disease risk, and these loci predominantly lie in the non-coding genome [1] Identifying functional and causal non-coding variants from GWAS, and the target genes they regulate, remains a major challenge. The model for effects on trait proposes that non-coding variants alter the targets of sequence-specific nucleic acid binding proteins in cis-regulatory elements (CREs) resulting in quantitative changes in gene expression. This could be at the level of modulating RNA splicing by RNA binding proteins [2], or through altering the affinity for DNA binding proteins such as transcription factors (TF) at enhancers [3, 4]. Rare, pathogenic mutations in enhancers have been shown to reduce [5] or increase [6] TF binding. Contrasting the large functional affects for rare alleles, common single nucleotide variants (SNVs) tagged in GWAS are thought to subtly modulate TF occupancy at CREs [4, 7, 8].
Because GWAS identifies regions that contain multiple SNVs in linkage disequilibrium (LD), identifying the functionally relevant SNV(s) poses a significant challenge. One route is to analyze chromatin structure at active CREs. Binding of TFs at enhancers leads to nucleosome displacement, and recruitment of co-activators that can then further modify chromatin including, but not limited to, acetylation of histone H3 lysine 27 (H3K27ac) [9]. The Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) is widely used for profiling sites with disrupted nucleosome structure and has been used in efforts to prioritize risk variants [10, 11]. Prior to the widespread use of ATAC-seq, DNase hypersensitive site-seq (DHS-seq) and DNase I footprinting have been used to identify sites of likely TF-binding and nucleosome disruption and to assess the effect of genetic variants on TF binding [4, 12, 13].
Enhancers and their target genes are most often located in the same topologically associating domain (TAD). Therefore, the integration of GWAS with data on tissue-specific features of active chromatin (ATAC-seq, DHS-seq, DNase footprinting, H3K27ac ChIP-seq) and chromosome conformation capture has been used to try and predict functional variants and their target genes [14, 15, 16]. However, predictions based on active regulatory marks often fall short in pinpointing functional variants [17, 18].
Computational models have also been developed to predict functional non-coding variants using machine learning methods including deep neural networks or linear regression models based on feature annotations, such as open chromatin, histone marks, TF binding, and conservation scores [19]. However, while showing some efficacy for rare variants, they poorly predict variant effects for common non-coding variants [19, 20]. Prioritizing and identifying functional non-coding variants therefore remain a barrier to GWAS follow-up studies.
Another feature of active enhancers is the production of short, unstable bidirectional enhancer RNAs (eRNAs) [21]. To the best of our knowledge, this feature has not been used in the prediction of functional genetic variants. Here, we analyze the utility of active regulatory element marks—open chromatin, H3K27ac, DNase footprints, TF binding, as well as the presence of divergent eRNA, for the prioritization and discovery of functional variants. We analyzed active regulatory markers using > 50,000 datasets from > 2000 cell types, tissue types, and treatment conditions and benchmarked these features with variants that demonstrate enhancer modulating activity using cis-acting molecular quantitative trait loci (molQTL). We find that the combination of DNase footprints and eRNA achieves high precision in predicting functional non-coding variants, and we apply this approach to data from GWAS traits. In this way, we demonstrate a framework to allow prioritization of candidate variants from GWAS data to aid discovery of functional and causative genetic variants.
Results
A compendium of genomic markers of active regulatory regions
Functional non-coding genetic variants are thought to reside in CREs. Based on this, we collated a compendium of published datasets of markers for active CREs from human cells and tissues. We included markers for open chromatin (DHS from DNase-seq and Assay for Transposase-Accessible Chromatin (ATAC-seq)), H3K27ac, divergent enhancer RNA (eRNA), DNase footprints, and DNA-binding factor occupancy from chromatin immunoprecipitation (ChIP). eRNA data come from multiple assays including Cap Analysis Gene Expression (CAGE), Global run-on (GRO)-seq, Precision nuclear run-on (PRO)-seq, and capped small RNA-seq [22]. Data were collected from uniformly processed resources, which included multiple replicates, across cell lines, primary cells, or tissue biosources. In aggregate, this comprised > 50,000 datasets from > 2000 biosources (Fig. 1A, Additional file 1: Table S1). Most datasets are from ChIP-seq, including ChIP for 956 DNA-binding factors and for some factors across multiple cell types or treatment conditions. Only one cell-type, hepatocytes, shares data across all assay types (Additional file 1: Fig. S1).
Taking the aggregate across all biosources for each marker, the number of elements ranges from 1.3 million (eRNA) to 93.3 million (ATAC-seq), with the median length of the elements dependent on the assay (Fig. 1B). The majority of active CREs are found in the non-coding regions—intergenic regions, promoters, or introns (Fig. 1C). We computed the pairwise Jaccard index for markers, which shows the highest similarity is between ATAC and H3K27ac (Fig. 1D).
For each marker, we merged overlapping datasets across all biosources to determine the non-redundant genome coverage of active CREs. The largest coverage is represented by ATAC-seq (75.2%), followed by H3K27ac (61.5%). This is likely due to the high number of biosources (Fig. 1A), the longer median length and distribution (Fig. 1B), but also probably indicates too low a threshold used in peak calling. The smallest coverage is from eRNA (2.1%) followed by DNase footprints (3.6%) (Additional file 1: Fig. S1B).
Prioritization of genetic variants by combinations of regulatory markers
Functional non-coding genetic variants are likely to impact target gene expression through altering CRE activity. To assess the utility of markers of active CREs in prioritizing functional non-coding variants, we established a set of variants which colocalize with molecular quantitative trait loci (molQTL) to benchmark against. While genetic variants that impart phenotypic variation have numerous terms dependent on the molecular trait, we use the unifying term QTL here to include molQTL detected by regression, but also variants with allelic imbalance determined by other statistical methods. We selected three molQTL assays which measure different molecular mechanisms and which demonstrate variant effects in cis that impact CRE activity: DNA-binding factor QTL (bQTL) [23], massively parallel reporter assay (MPRA) QTL (raQTL) [24], and chromatin accessibility QTL (caQTL)—a merge of two sources [25, 26] and [13, 27]—also called chromatin-altering variants (CAV) (Fig. 2A). Variants identified by these molQTL assays are derived from > 16 K datasets across > 800 biosources (Fig. 2B, Additional file 1: Table S3). Most of these datasets are from bQTL assays based on ChIP-seq annotations, while caQTL are based on DNase-seq and ATAC-seq datasets. raQTL variants were obtained from one resource which identified raQTLs in K562 and HepG2 cell lines, using an unbiased set of 5.9 million variants [24]. We have not included other studies using MPRA to identify raQTL as these often use variants identified from specific GWAS traits which would bias our downstream analysis. There was minimal overlap between variants colocalizing with bQTL, raQTL, and caQTL (Fig. 2C). This maintains each molQTL assay as a largely independent set of variants.
Using each molQTL to benchmark, we aimed to identify functional variants from the GWAS catalog [28, 29], which is derived from > 28 K studies and > 5 K traits or diseases, and containing variant-trait associations with p value of ≤ 1.0 × 10−5 (Additional file 1: Table S3). This set contains > 180 K non-redundant variants, the majority of which are in non-coding regions (Additional file 1: Fig. S2A). To identify how active regulatory marks can functionally annotate non-coding variants, we merged all biosources for each mark, and in the case of ChIP this included merging all DNA-binding factors (Additional file 1: Fig. S2B). This overcomes two main issues. Firstly, that the cell- or tissue-type in which a variant is functional is not known a priori. Secondly, the low overlap of biosources for each marker (Fig. 1B) would limit intersection analysis. We therefore took a cell/tissue agnostic approach by merging biosources. With this, we determined the co-localization of the GWAS catalog variants with each marker of active CREs, and all combinations of markers, totaling 55 combinations from two-way up to six-way combinations.
We determined the precision (positive predictive value (PPV) − [true positive/(true + false positives)]) for each marker, and their combinations, by the intersection with variants from the GWAS catalog, benchmarked against the intersection of the marker combinations with each molQTL as a true positive set. The precision value was dependent on the size of the molQTL set relative to the larger GWAS catalog. The range of precision values was 7.5–55.46% for bQTL, raQTL 0.42–3.53%, and caQTL 4.02–27.7%. In all cases, the highest precision was achieved by the combination of all six markers. To compare precision values between the molQTL, we converted precision values to Z-scores for each molQTL, giving a range of scores from − 2.2 to 1.44. Using unsupervised clustering, we generated a Z-score heatmap of precision values for feature combinations against molQTL assays, showing four clusters (Fig. 3A). The cluster with the highest Z-scores (cluster 3) showed that two markers, DNase footprints and eRNA, were present for all feature combinations within the cluster, including the combination of these two markers alone. In contrast, the cluster with the lowest Z-score (cluster 1) excluded any combination of markers that had footprints or eRNA. As independent markers, DNase footprint or eRNA were present in an intermediate cluster (cluster 2). This suggests that the combination of DNase footprints and eRNA, with or without other markers, provides high precision for identifying functional variants. Increasing the number of features increased the number of predicted negative variants, although the number of predicted negative variants is > 90% for footprints or eRNA alone (Additional file 1: Fig. S3A).
While all six features are markers of active CREs, we considered if GWAS variants that intersect these features are associated with functional enhancers. Using a validated set of functional enhancers from CRISPR deletion and CRISPR interference experiments, and non-enhancers from MPRA experiments [22], we overlapped GWAS variants with feature(s). We find that GWAS variants in footprints and eRNA independently are associated with functional enhancers compared to other features alone (Additional file 1: Fig. S3B). However, GWAS variants that coincide with footprints and eRNA together have the highest overlap with functional enhancers.
We next considered the precision of marker combinations to identify functional variants across the entire genome, without association with GWAS traits. Taking all common genetic variants (minor allele frequency (MAF) ≥ 1%), consisting of > 660 M non-redundant SNVs, we randomly selected subsets to obtain a comparable number to the GWAS catalog of ~ 180 K variants. We calculated the precision of combinations of markers and then converted this to a fold change over random SNVs. We repeated the random SNV subsetting 100 times to generate a mean fold change and plotted the markers and combinations from lowest to highest fold change (Fig. 3B). This showed that ATAC-seq, H3K27ac, and the combination of the two are poor discriminators of functional variants (mean fold change 1.12–1.44). In comparison, a step change is observed for the combination of DNase footprints and eRNA, with a mean fold change of 13.89 over random SNVs. Footprints and eRNA alone show a mean fold change of 7.37 and 8.09, respectively, suggesting that they can individually provide functional annotation of variants across the genome, but with higher precision when combined. This is in keeping with the precision observed from footprints with eRNA to identify functional variants in the GWAS catalog. Our analyses suggest that the combination of DNase footprints and eRNA are strong predictors of functional variants with high precision, with additional markers providing only small increments in precision. Importantly, when considering the precision of GWAS compared to random genomic variants, the GWAS catalog shows a mean fold change of 8.24 which, perhaps expectedly, demonstrates that GWAS can identify functional variants, above a random model.
To corroborate our findings, we employed an independent machine learning approach. We trained a random forest classifier model to predict any molQTL signature (bQTL/raQTL/caQTL) based on overlaps with the six active regulatory markers, providing a predictive value of 0.64. We determined SHAP values as a measure of feature importance for the model, which revealed that DNase footprints and eRNA are the top positively associated features for the predictive model (Fig. 3C).
Effect of thresholding of feature calls on precision scores
The local genomic enrichment of assay signals is often translated to interval regions (peaks) through different methods and at varying significance thresholds. These statistical variations in calling enriched regions, for example, false discovery rates (FDR), may impact precision scores for functional variants. For DNase footprints, we used consensus footprints determined by a posterior probability > 0.99 across biosources [13, 27]. We compared this with FDR for footprints from individual biosources and determined the precision scores associated with increasing FDR stringency. For footprints, we find that increasing stringency does increase precision, although the consensus footprint calling performs well (Additional file 1: Fig. S3C). For the remainder of our analyses, we restrict to the consensus set of footprints as these are derived across biosources and are less biased towards individual datasets or cell/tissue types. As ATAC-seq performs relatively poorly with regard to predicting functional variants, peak calling thresholds may impact precision scores. We used an independently analyzed source (ATACdb) [30, 31], which has a smaller number of datasets (~ 1400) compared to the ~ 3700 from GTRD [32, 33]. We further threshold these datasets by p value to increase stringency over local background. We find ATACdb to have a high precision score, but increasing p value thresholds does not improve precision scores. We have taken a third source of ATAC-seq data from ENCODE, using narrow peaks to define local regions of high amplitude and consider this dataset with and without irreproducible discovery rate (IDR) thresholding. We find the ENCODE narrow peaks precision scores improve slightly with IDR thresholding (Additional file 1: Fig. S3). Data processing and significance stringency can therefore impact the precision scores for identifying functional variants. However, this analysis shows that the relatively low predictive value of ATAC-seq peaks is not simply due to insufficiently stringent thresholding.
Centrality in open chromatin to predict functional non-coding variants
TF binding can initiate and maintain chromatin accessibility [34] and indeed summits of DHS have been shown to be enriched for DNA-binding factor motifs [13]. Using a consensus reference map of DHS with summit annotations [35, 36], we tested if centrality in DHS could predict functional variants. We compared core DHS regions, which comprises a consensus centroid region aggregated across biosamples, and distances from the summit at 25 bp, 50 bp, and 100 bp up- and downstream, with a median core length of 37 bp (Additional file 1: Fig. S4A, B). By using variants from the intersection with the GWAS catalog benchmarked against molQTL variants, we calculated precision scores as fold change above GWAS catalog alone. DHS cores have ~ 1.5-fold precision, and this is not increased by extending to 100 bp on either side of the summit (Fig. 4A). Although many footprints fall outside the DHS core, DNase footprints are found at a higher frequency nearer DHS summits (Fig. 4B). Together, these data suggest a mechanism for functional variants related to the enrichment of DNA-binding motifs near DHS summits. When comparing centrality as a means to identify functional variants with DNase footprints and eRNA, the precision is even higher with the combined DNase footprints/eRNA markers (Fig. 4A).
Replication with functional, rare, non-coding Mendelian disease-associated variants
To validate our observation of high precision using the combination of DNase footprints and eRNA, we sought to replicate our analysis using a set of validated functional non-coding variants. We used a published set of manually curated rare non-coding variants associated with Mendelian diseases consisting of 210 non-redundant variants in promoter, intergenic, intronic, or 3′UTR regions [37] (Fig. 4C). Given the small number of variants within this validated set, we limited our precision calculations to single markers, or the combination of DNase footprints and eRNA only. We determined precision by spike in of ~ 4 K randomly selected variants across the whole genome, giving a true positive rate of approximately random chance (~ 0.05). We generated precision and recall scores based on the mean from 10 permutations of random genomic variant spike-ins (Fig. 4D). The combination of DNase footprints and eRNA shows the highest precision at 0.65, but low recall (0.25). In contrast, ATAC-seq and H3K27ac show low precision but high recall. DNase footprints and eRNA each alone have intermediate precision and recall. These observations are consistent with our analysis of the GWAS catalog and genomic variants, suggesting the utility of DHS footprints and eRNA in identifying both rare and common functional non-coding variants.
Prioritizing variants from genome-wide association studies
To demonstrate the utility of DNase footprints and eRNA in prioritizing non-coding variants, we analyze 53 traits for intersections with footprints, eRNA, or both (Table 1). Across these traits, the majority of associated variants are, as expected, in the non-coding genome, and this is maintained when using the combined footprint and eRNA marks to prioritize variants (Fig. 5A). Using the leukocyte count trait from the GWAS catalog to exemplify our prioritization approach, ~ 10 K non-redundant variants with significance of < 9 × 10−6 from 51 studies are reduced ~ 30-fold by requiring co-localization with DNase footprints and eRNA (Table 1). Using gene ontology (GO) analysis of neighboring genes, we applied GREAT analysis [38] to all variants from the trait compared to just the variants within footprints and eRNA. GO terms are preserved between all variants and those in footprints and eRNA (Fig. 5B), with enriched GO terms in keeping with leukocyte function. For the most significant GO term, regulation of immune response, we observe preservation of 55% of genes in the footprint and eRNA variant set, compared with all variants (Fig. 5C) suggesting that, despite a ~ 30-fold reduction in number of candidate variants, the enriched genes and loci are largely preserved.
Non-coding variants function in a cell-type specific manner (e.g., [46, 47]) but uncovering the affected cell-type from GWAS can be challenging. By leveraging DNase footprints and eRNA to identify functional variants with high precision, we can then infer the relevant cell-type from this subset of variants. As both footprints and eRNA are associated with open chromatin [25], we took the leukocyte count associated variants co-localizing with footprints and eRNA and identified cell-type specific enrichment of DHS using FORGE2 [48]. Since footprints are derived from DNase-seq, the subset of variants that colocalize with DNase footprints and eRNA show a greater enrichment in DHS compared to an equal number of randomly selected GWAS variants for leukocyte count. Importantly however, despite our cell- and tissue-type agnostic approach of merging features, colocalized variants show enrichment in hematopoietic or immune cell types (Fig. 5D), demonstrating preserved ability to identify trait relevant cell-types despite ~ 30-fold reduction in variant number. Highlighting one variant identified from DNase footprints and eRNA, rs12722502 is an IL2RA intronic variant and is in a myeloid and lymphoid specific DHS (Fig. 5E). This variant is predicted to cause a loss of binding of ETS family TFs (Fig. 5F), members of which are known to regulate immune cell function [49]. Indeed, rs12722502 is associated with fivefold reduction in binding of ELF1 (a TF in the ETS family), in a human leukemic T-cell line [50].
Identifying functional variants in linkage disequilibrium with lead variants
Uncovering functional or causative variants from GWAS is also challenged by LD, whereby the causative variant may be a variant in LD with the GWAS lead variant. We analyzed the utility of the six active regulatory features, and the combination of footprints and eRNA, to identify functional variants in LD with GWAS variants. Using GWAS variants from 53 traits, and variants in LD (R2 > 0.7) with each GWAS variant, we curated ~ 1.5 M variants across traits, which together consisted of ~ 800 K non-redundant variants. Taking GWAS and LD variants for each trait, we calculated precision and recall scores, by benchmarking against the combined set of variants in bQTL, raQTL, and caQTL. We determined mean precision and recall scores for the regulatory feature(s) (Fig. 6A, B) and observe the DNase footprint and eRNA combination to reproducibly have the highest precision, but low recall. While including variants in LD is important, this increases the variant set by a mean of ~ 27-fold across traits. However, considering only LD variants intersecting footprints and eRNA reduces the variant set by a mean of ~ 80-fold (Additional file 1: Table S2).
To demonstrate the validity of colocalization with DNase footprints and eRNA to prioritize variants in LD with lead variants, we selected the asthma trait, and considered strong lead GWAS variants which were not associated with the HLA locus, to avoid cell-type bias. We identified rs72823641, which has been found to be associated with asthma in multiple GWAS [51], with p values ranging from 4 × 10−12 to 2 × 10−136, and mapping to a large intron of IL1RL1. We found 92 variants in LD with rs72823641 (R2 > 0.7) using European ancestry data (Fig. 6C). Intersecting these with DNase footprints and eRNA reduced the variants to one candidate—rs10173081 (R2 = 0.96), which resides in myeloid specific open chromatin (Fig. 6D). While rs10173081 is not tagged in the GWAS catalog, it has been observed in other GWAS for asthma [52]. rs10173081 is also an eQTL for IL1RL1 and IL18R1 in lung, but also for IL18RAP and a non-coding RNA AC007278.3 in whole blood (Fig. 6E). Applying TF motif based binding prediction, we identified putative loss of binding for TFs including TCF21, TCF12, TCF4, and KLF3 (Fig. 6F). Interestingly, TCF21 and KLF3 loci have been previously identified as asthma risk-associated GWAS loci [53]. In conclusion, by using combined DNase footprint and eRNA markers, plausible functional variants can be prioritized from variants sets in LD with lead variants.
Discussion
GWAS have been instrumental in uncovering complex trait- and disease-associated variants and regulatory pathways. However, the identification of functional and causative variants remains a significant challenge, impacting the mechanistic understanding of complex disease genetics and the identification of disease/trait relevant pharmacological targets. Consequently, few GWAS have been translated into therapeutic interventions [54]. Discovering functional variants in the non-coding genome constrained by methodological limitations. Computational approaches have been developed that utilize variable genomic features to aid variant prioritization, including functional annotations, conservation scores, and TF motifs [19, 55]. However, for non-coding variants, these methods tend to have low precision for identifying functional variants [19].
Here, we systematically analyzed six assays for active CREs, benchmarking against independent molQTLs for enhancer activity. We find that ATAC peaks and H3K27ac are poor predictors. This is in keeping with previous studies that have shown that deletion of elements harboring candidate variants marked by ATAC peaks or H3K27ac is unable to detect functional impact [17, 18, 56]. We hypothesize that the poor precision, but high recall, of ATAC-seq and H3K27ac peaks relates to their high feature size and high genomic coverage. In contrast, we identify a signature that performs well with respect to precision in identifying genetic variants with putative function. DNase footprints and eRNA have independently been suggested to be enriched for putative functional variants [13, 22]. Individually DNase footprints and eRNA both show good precision; however, we demonstrate that their combination provides superior discrimination. When using genomic features for prioritizing functional variants, consideration should be made for statistical stringency as this may alter precision for functional discovery. Additionally, centrality may contribute, but we find including central regions imparts little improvement in precision compared to the combined DNase footprint and eRNA marks.
Our model for the precision offered by combined DNase footprints and eRNA markers is based on direct DNA binding of a factor, such as a TF, leading to recruitment of RNA polymerase II and other machinery to promote regulatory activity. In agreement with this model, eRNAs have been shown be an independent predictor of functional enhancers in mice in vivo [57], and to predict TF binding, where TF binding and eRNA origins colocalize [4]. eRNAs have also been found to mark functional enhancer-promoter interactions [58]. Together, these suggest eRNAs are important markers of functional CREs.
Some markers of active elements perform with low precision, such as open chromatin and DNA-factor binding (ChIP). Consistent with this, < 30% of ATAC peaks in primary CD4 + T-cells demonstrated activity using an MPRA, with MPRA signals instead correlating well with eRNA production [59]. If evidence of binding is required for functional variant prediction, then ChIP-seq signals would be expected to perform well. However, MPRA of glucocorticoid receptor binding sites showed < 20% had significant activity [60], suggesting that many bound TFs are not productive. However, artefactual binding observed from cross-linking in ChIP-seq experiments [61], and co-localization of multiple TF binding events enriching for non-related motifs, can lead to false positive binding sites [62]. Therefore, the footprint signal of direct TF binding, with a detection of eRNAs, provides better evidence for an active enhancer. Variants within these regions, particularly in DNA sequences interfacing with TF DNA binding domains (DBD), are thus more likely to be functional.
Despite the high precision for functional variants using the combination of DNase footprint and eRNA, restricting to these two markers comes with limitations, principally related to low recall, and a high number of false negatives. The low recall can be explained by limitations inherent to the DNase footprint and eRNA datasets. Firstly, the identification of DNase footprints may be related to residence time of TFs on DNA, where rapidly exchanging factors impart poor footprints [63]. Variants associated with altering binding of dynamic factors may therefore be missed, although this could be improved using modified methods such as DNase-seq with crosslinking [64]. Detection of footprints could also be improved by bias correction to improve false-negative rates [65]. Secondly, the eRNA database used in our analysis represents the lowest number of biosources, and is therefore incomplete, with highly variable numbers of features between cell-types, and predominantly under basal conditions, which would exclude context-dependent enhancer activity [22, 60]. To address this, eRNA profiling is required in cell-types, and under conditions, relevant to the GWAS trait of interest. The specific assay for divergent eRNAs may also impact precision scores for functional variants, as GRO-cap has been reported to have the highest sensitivity for functional enhancer detection [22].
Conclusion
We present our findings as a framework, FINDER—Functional SNV IdeNtification using DNase footprints and Enhancer RNA (Fig. 7). In this framework, the combination of DNase footprints and eRNA can prioritize and reduce candidate variants to a set with high precision. The merged datasets used can be analyzed for relevant cell-types associated with the complex trait or disease. This has important implications for current computational approaches and databases for predictive functional variant scores. Our findings can inform predictive scoring approaches as these do not utilize eRNA as a predictive marker. Finally, the high precision of DNase footprints and eRNA compared to other marks may be used for functionally informed fine mapping approaches [66], which could improve detection of causative non-coding variants.
Methods
Data sources
Datasets for active chromatin marks were obtained from published and publicly available sources. DHS [35, 36] and DNase footprints [13, 27] were obtained from ENCODE. H3K27ac was obtained from GTRD [32, 33] and dbInDel [43, 67]. Human ChIP-seq and ATAC-seq datasets were obtained from GTRD [32, 33]. Enhancer RNAs were obtained from the PINTS portal [22, 44]. These data sources provide uniformly analyzed datasets for downstream analysis. Details of downloaded versions and sources are shown in Additional file 1: Table S3. For each type of active chromatin mark, to generate assay master files agnostic to multiple cell-types or factor in the case of ChIP, Bedtools merge version from the bedtools suite v2.30.0 [42] was used, requiring overlap of 1 bp or greater. The Jaccard index was computed using bedtools jaccard from the bedtools suite v2.30.0 [42].
QTL datasets were obtained from published datasets. Binding QTL (bQTL) were downloaded from AD ASTRA which determines allele-specific binding events from human ChIP-seq datasets and identifies genetic variants from ChIP-seq data [23, 68]. Reporter assay QTL (raQTL) data were downloaded from publicly available massively parallel reporter assays (MPRA) [24]. Chromatin accessibility QTL (caQTL) were downloaded from ENCODE [13, 27] and QTLbase [25, 26]. Details of downloaded versions and sources are shown in Additional file 1: Table S3. For each molecular QTL, where an rsID occurs in multiple datasets, either multiple assays (such as ChIP) or multiple cell types, these were collapsed to generate a list of unique rsIDs.
GWAS summary statistics were downloaded from the NHGRI-EBI GWAS catalog V1.0.2 [28, 29] using variants with a p value < 10−5. Where analysis presented is agnostic to traits or studies, a master list of rsID found in the GWAS catalog was generated by collapsing to a list of unique rsIDs. Details of downloaded version and source are shown in Additional file 1: Table S3.
The dataset for genetic variants was obtained from UCSC [41] using the dbSNP build 151 containing uniquely mapping common SNPs with a ≥ 1% minor allele frequency (MAF). SNPs were randomly subsetted using gshuf in command line.
Confusion matrix calculations and normalizations
Precision scores were calculated as the fraction of the number of variants (GWAS catalog or other variant sets as indicated) overlapping the benchmarked molQTL (true positives) over the total number of variants (true and false positives) intersecting the feature (or feature combinations). Precision scores were expressed as a Z-score or fold change. Clustering of precision z-scores was performed in R using heatmap cluster function with the Ward D2 method. Fold changes were expressed as the precision score of the candidate set intersected with features divided by precision score of the candidate set without intersection with features.
Genome build conversions
Data were analyzed using human genome build Hg38. Where primary data sources were aligned to other genome versions (Hg 19), data was converted to Hg38 using Liftover [69]. The genome version of the primary data sources is indicated in Additional file 1: Table S3.
Intersections of features
To analyze the overlap between genetic variants, and/or genomic features, Bedtools intersect from the bedtools suite v2.30.0 [42] was used. For intersections with rsID, these were considered as the first feature (-a) in bedtools intersect -a < bed/vcf > -b < bed > .
Random forest classification
The PeakPredict package [45] was run using the command overlap_peaks with settings “–predict_column molQTL –model RandomForestClassifier –balance –shap,” where balancing downsamples the groups to have the same size prior to splitting into test and training sets. The PeakPredict package implements scikit-learn [70] and SHAP values [71].
Gene ontology analysis
GO analysis utilized Genomic Regions Enrichment of Annotations Tool (GREAT) version 4.0 [38] with rsID converted to BED format, using the default settings.
Predicting functional base on motif
Predictions of TF binding based on motifs utilized the FABIAN-variant [72] web interface, using Hg38m, using “All” TFS, the “TFFM detailed” model, and the JASPAR2022 database.
Linkage disequilibrium analysis
To generate variants in LD with a lead variant, we included variants within 500 Kb of a lead variant with an R2 ≥ 0.7 based on European ancestry and a minor allele frequency of at least 1%. To compile variants in LD, we used TopLD API with options -thres 0.7 -pop EUR -maf 0.01 [73].
Cell-type specific analysis
To identify cell-type enrichment of variants, we used FORGE2 [48], with input as rsID of variants, using DHS as data for enrichment, without LD filtering of rsIDs.
Genome location mapping
To annotate the genomic features to genome location, we used the annotatePeaks.pI program from HOMER tools v4.1 [74]. Gene annotation was customized using a GTF file containing hg38 gene transcript sets, downloaded from UCSC (hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz)
Availability of data and materials
Individual data sets are described in “Data sources” of the “Methods” section. The merged datasets for each active regulatory marker, merged as described in “Data sources,” and QTL merged datasets can be found at GitHub [77], available under an open source (MIT) license, and at Zenodo [78]. Further details are found in Additional file 1: Table S3.
References
Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP, Sandstrom R, Qu H, Brody J, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–5. https://doi.org/10.1126/science.1222794.
Qi T, Wu Y, Fang H, Zhang F, Liu S, Zeng J, Yang J. Genetic control of RNA splicing and its distinct role in complex trait variation. Nat Genet. 2022;54:1355–63. https://doi.org/10.1038/s41588-022-01154-4.
Johnston AD, Simões-Pires CA, Thompson TV, Suzuki M, Greally JM. Functional genetic variants can mediate their regulatory effects through alteration of transcription factor binding. Nat Commun. 2019;10:3472. https://doi.org/10.1038/s41467-019-11412-5.
Maurano MT, Haugen E, Sandstrom R, Vierstra J, Shafer A, Kaul R, Stamatoyannopoulos JA. Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo. Nat Genet. 2015;47:1393–401. https://doi.org/10.1038/ng.3432.
Jeong Y, Leskow FC, El-Jaick K, Roessler E, Muenke M, Yocum A, Dubourg C, Li X, Geng X, Oliver G, et al. Regulation of a remote Shh forebrain enhancer by the Six3 homeoprotein. Nat Genet. 2008;40:1348–53. https://doi.org/10.1038/ng.230.
Lettice LA, Williamson I, Wiltshire JH, Peluso S, Devenney PS, Hill AE, Essafi A, Hagman J, Mort R, Grimes G, DeAngelis CL, Hill RE. Opposing functions of the ETS factor family define Shh spatial expression in limb buds and underlie polydactyly. Dev Cell. 2012;22:459–67. https://doi.org/10.1016/j.devcel.2011.12.010.
Carrasco Pro S, Bulekova K, Gregor B, Labadorf A, Fuxman Bass JI. Prediction of genome-wide effects of single nucleotide variants on transcription factor binding. Sci Rep. 2020;10:17632. https://doi.org/10.1038/s41598-020-74793-4.
Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Rozowsky J, Shi M, Urban AE, et al. Variation in transcription factor binding among humans. Science. 2010;328:232–5.
Vernimmen D, Bickmore WA. The hierarchy of transcriptional activation: from enhancer to promoter. Trends Genet. 2015;31:696–708. https://doi.org/10.1016/j.tig.2015.10.004.
Grishin D, Gusev A. Allelic imbalance of chromatin accessibility in cancer identifies candidate causal risk variants and their mechanisms. Nat Genet. 2022;54:837–49. https://doi.org/10.1038/s41588-022-01075-2.
Su C, Gao L, May CL, Pippin JA, Boehm K, Lee M, Liu C, Pahl MC, Golson ML, Naji A;, et al. 3D chromatin maps of the human pancreas reveal lineage-specific regulatory architecture of T2D risk. Cell Metab. 2022;34:1394-1409.e4. https://doi.org/10.1016/j.cmet.2022.08.014.
Moyerbrailean GA, Kalita CA, Harvey CT, Wen X, Luca F, Pique-Regi R. Which genetics variants in DNase-seq footprints are more likely to alter binding? PLoS Genet. 2016;12: e1005875. https://doi.org/10.1371/journal.pgen.1005875.
Vierstra J, Lazar J, Sandstrom R, Halow J, Lee K, Bates D, Diegel M, Dunn D, Neri F, Haugen E, et al. Global reference mapping of human transcription factor footprints. Nature. 2020;583:729–36. https://doi.org/10.1038/s41586-020-2528-x.
Gazal S, Weissbrod O, Hormozdiari F, Dey KK, Nasser J, Jagadeesh KA, Weiner DJ, Shi H, Fulco CP, O’Connor LJ, et al. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat Genet. 2022;54:827–36. https://doi.org/10.1038/s41588-022-01087-y.
Nasser J, Bergman DT, Fulco CP, Guckelberger P, Doughty BR, Patwardhan TA, Jones TR, Nguyen TH, Ulirsch JC, Lekschas F, et al. Genome-wide enhancer maps link risk variants to disease genes. Nature. 2021;593:238–43. https://doi.org/10.1038/s41586-021-03446-x.
Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M. Linking disease associations with regulatory information in the human genome. Genome Res. 2012;22:1748–59. https://doi.org/10.1101/gr.136127.111.
Chen HV, Lorenzini MH, Lavalle SN, Sajeev K, Fonesca A, Fiaux PC, Sen A, Luthra I, Ho AJ, Chen AR, et al. Deletion mapping of regulatory elements for GATA3 in T cells reveals a distal enhancer involved in allergic diseases. J Hum Genet. 2023;110(4):703–14.
Gasperini M, Findlay GM, McKenna A, Milbank JH, Lee C, Zhang MD, Cusanovich DA, Shendure J. CRISPR/Cas9-mediated scanning for regulatory elements required for HPRT1 expression via thousands of large, programmed genomic deletions. Am J Hum Genet. 2017;101(2):192–205.
Wang Z, Zhao G, Li B, Fang Z, Chen Q, Wang X, Luo T, Wang Y, Zhou Q, Li K, et al. (2022) Performance comparison of computational methods for the prediction of the function and pathogenicity of non-coding variants. Genom Proteomics Bioinform. 7:S1672–0229(22)00016-X. https://doi.org/10.1016/j.gpb.2022.02.002.
Tabarini N, Biagi E, Uva P, Iovino E, Pippucci T, Seri M, Cavalli A, Ceccherini I, Rusmini M, Viti F. Exploration of tools for the interpretation of human non-coding variants. Int J Mol Sci. 2022;23:12977. https://doi.org/10.3390/ijms232112977.
Hou TY, Kraus WL. Spirits in the material world: enhancer RNAs in transcriptional regulation. Trends Biochem Sci. 2021;46:138–53. https://doi.org/10.1016/j.tibs.2020.08.007.
Yao L, Liang J, Ozer A, Leung AKY, Lis JT, Yu H. A comparison of experimental assays and analytical methods for genome-wide identification of active enhancers. Nat biotechnol. 2022;40(7):1056–65.
Abramov S, Boystov A, Bykova D, Penzar DD, Yevshin I, Kolmykov SK, Fridman MV, Favorov AV, Vorontsov IE, Baulin E, Kolpakov FF, Makeev V, Kulakovskiy IV. Landscape of allele-specific transcription factor binding in the human genome. Nat Commun. 2021;12(1):2751. https://doi.org/10.1038/s41467-021-23007-0.
Van Arensbergen J, Pagie L, FitzPatrick VD, de Haas M, Baltissen MP, Comoglio F, van der Weide R, Teunissen H, Vosa U, Franke L, de Wit E, Vermeulen M, Bussemaer HHJ, van Steensel B. High-throughput identification of human SNPs affecting regulatory element activity. Nat Genet. 2019;51(7):1160–9.
Zheng Z, Huang D, Wang J, Zhao K, Zhou Y, Guo Z, Zhai S, Xu H, Cui H, Yao H, Wang Z, Yi X, Zhang S, Sham PC, Li MJ. QTLbase: an integrative resource for quantitative trait loci across multiple human molecular phenotypes. NAR. 2020;48(D1):D983-991.
Zheng Z, Huang D, Wang J, Zhao K, Zhou Y, Guo Z, Zhai S, Xu H, Cui H, Yao H, Wang Z, Yi X, Zhang S, Sham PC, Li MJ. QTLbase [Internet]; 2016–2022. Available from: http://www.mulinlab.org/qtlbase.
Vierstra J, Lazar J, Sandstrom R, Halow J, Lee K, Bates D, Diegel M, Dunn D, Neri F, Haugen E, et al., Digital genomic footprinting [Internet]; 2020. Available from: https://www.vierstra.org/resources/dgf.
Sollis E, Mosaku A, Abid A, Buniello A, Cerezo M, Gil L, Groza T, Gunes O, Hall P, Hayhurst, et al. The NHGRI-EBI GWAS catalog: knowledgebase and deposition resource. NAR. 2023;51(D1):D977-985.
Sollis E, Mosaku A, Abid A, Buniello A, Cerezo M, Gil L, Groza T, Güneş O, Hall P, Hayhurst J, et al.., NHGRI-EBI GWAS catalog [Internet]; 2022. Available from: https://www.ebi.ac.uk/gwas/.
Wang F, Bai X, Wang Y, Jiang Y, Ai B, Zhang Y, Liu Y, Xu M, Wang Q, Han X, et al. ATACdb: a comprehensive human chromatin accessibility database. Nucleic Acids Res. 2021;49(D1):D55–64.
Wang F, Bai X, Wang Y, Jiang Y, Ai B, Zhang Y, Liu Y, Xu M, Wang Q, Han X, et al., ATACdb [Internet]; 2020. Available from: https://bio.liclab.net/ATACdb/.
Kolmykov S, Yevshin I, Kuulyashov M, Sharipov R, Kondrakhin Y, Makeev VJ, Kulakovskiy IV, Kel A, Kolpakov F. GTRD: an integrated view of transcription regulation. NAR. 2021;49(D1):D104–11.
Kolmykov S, Yevshin I, Kuulyashov M, Sharipov R, Kondrakhin Y, Makeev VJ, Kulakovskiy IV, Kel A, Kolpakov F. GTRD [Internet]; 2021. Available from: https://gtrd.biouml.org.
Biddie SC, John S, Sabo PJ, Thurman RE, Johnson TA, Schiltz RL, Miranda TB, Sung MH, Trump S, Lightman SL, et al. Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. Mol Cell. 2011;43(1):145–55. https://doi.org/10.1016/j.molcel.2011.06.016. (PMID: 21726817).
Meuleman W, Muratov A, Rynes E, Halow J, Lee K, Bates D, Diegel M, Dunn D, Neri F, Teodosiadis A, et al. Index and biological spectrum of human DNase I hypersensitive sites. Nature. 2020;584:244–51. https://doi.org/10.1038/s41586-020-2559-3.
Meuleman W, Muratov A, Rynes E, Halow J, Lee K, Bates D, Diegel M, Dunn D, Neri F, Teodosiadis A, et al Index and biological spectrum of human DNase I hypersensitive sites [Internet]; 2020. Available from: https://www.meuleman.org/research/dhsindex/
Smedley D, Schubach M, Jacobsen JOB, Köhler S, Zemojtel T, Spielmann M, Jäger M, Hochheiser H, Washington NL, McMurry JA, et al. A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease. Am J Hum Genet. 2016;99:595–606. https://doi.org/10.1016/j.ajhg.2016.07.005.
Tanigawa Y, Dyer ES, Bejerano G. WhichTF is functionally important in your open chromatin data? PLoS Comput Biol. 2022;18(8): e1010378. https://doi.org/10.1371/journal.pcbi.1010378.
Friman, ET. PeakPredict. Zenodo. 2024. https://zenodo.org/doi/10.5281/zenodo.12706471.
Sloan CA, Chan ET, Davidson JM, Malladi VS, Strattan JS, Hitz BC, Gabdank I, Narayanan AK, Ho M, Lee BT et al., ENCODE portal [Internet]; 2016. Available from: https://www.encodeproject.org/.
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. NAR. 2001;21(1):308–11.
Quinlan AR and Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26(6):841–2
Huang M, Wangg Y, Yangg M, Yan J, Yangg H, Zhuangg W, Xu Y, Koeffler HP, Lin DC, Chen X. dbInDel [Internet]; 202. Available from: http://enhancer-indel.cam-su.org.
Yao L, Liang J, Ozer A, Leung AKY, Lis JT, Yu H. PINTS web portal [Internet]; 2022. Available from: https://pints.yulab.org.
Friman, ET. PeakPredict. Github .2024. https://github.com/efriman/PeakPredict.
Rai V, Quang DX, Erdos MR, Cusanovich DA, Daza RM, Narisu N, Zou LS, Didion JP, Guan Y, Shendure J, et al. Single-cell ATAC-Seq in human pancreatic islets and deep learning upscaling of rare cells reveals cell-specific type 2 diabetes regulatory signatures. Mol Metab. 2020;32:109–21. https://doi.org/10.1016/j.molmet.2019.12.006.
Torres JM, Abdalla M, Payne A, Fernandez-Tajes J, Thurner M, Nylander V, Gloyn AL, Mahajan A, McCarthy MI. A multi-omic integrative scheme characterizes tissues of action at loci associated with type 2 diabetes. Am J Hum Genet. 2020;107:1011–28. https://doi.org/10.1016/j.ajhg.2020.10.009.
Breeze CE, Haugen E, Reynolds A, Teschendorff A, van Dongen J, Lan Q, Rothman N, Bourque G, Dunham I, Beck S, et al. Integrative analysis of 3604 GWAS reveals multiple novel cell type-specific regulatory associations. Genome Biol. 2022;23:13. https://doi.org/10.1186/s13059-021-02560-3.
Garrett-Sinha LA. Review of Ets1 structure, function, and roles in immunity. Cell Mol Life Sci. 2013;70:3375–90. https://doi.org/10.1007/s00018-012-1243-7.
Schwartz AM, Demin DE, Vorontsov IE, Kasyanov AS, Putlyaeva LV, Tatosyan KA, Kulakovskiy IV, Kuprash DV. Multiple single nucleotide polymorphisms in the first intron of the IL2RA gene affect transcription factor binding and enhancer activity. Gene. 2017;603:50–6.
Johansson Å, Rask-Andersen M, Karlsson T, Ek WE. (2019) Genome-wide association analysis of 350 000 Caucasians from the UK Biobank identifies novel loci for asthma, hay fever and eczema Hum Mol Genet. 28:4022–4041. https://doi.org/10.1093/hmg/ddz175.
Portelli MA, Dijk FN, Ketelaar ME, Shrine N, Hankinson J, Bhaker S, Grotenboer NS, Obeidat M, Henry AP, Billington CK, et al. Phenotypic and functional translation of IL1RL1 locus polymorphisms in lung tissue and asthmatic airway epithelium. JCI Insight. 2020;5: e132446. https://doi.org/10.1172/jci.insight.132446.
Daya M, Rafaels N, Burnetti TM, Chhavan S, Levin AM, Shetty A, Gignoux CR, Boorgula MP, Wojcik G, et al. Association study in African-admixed populations across the Americas recapitulates asthma risk loci in non-African populations. Nat Commun. 2019;10(1):880. https://doi.org/10.1038/s41467-019-08469-7.
Trajanoska K, Bherer C, Taliun D, Zhouu S, Richards JB, Mooser V. From target discovery to clinical drug development with human genetics. Nature. 2023;620(7975):737–45.
Dong S, Zhao N, Spragins E, Kagda MS, Li M, Assis P, Jolanki O, Luo Y, Cherry JK, Boyle AP et al. Annotating and prioritizing human non-coding variants with RegulomeDB v.2 Nat Genet. 20223;55(5):724–726
Downes DJ, Cross AR, Hua P, Roberts N, Schwessinger R, Cutler AJ, Munis AM, Brown J, Mielczarek O, de Andrea CE, et al. Identification of LZTFL1 as a candidate effector gene at a COVID-19 risk locus. Nat Genet. 2021;53(11):1606–15.
Wu H, Nord AS, Akiyama JA, Shoukry M, Afzal V, Rubin EM, et al. Tissue-specific RNA expression marks distant-acting developmental enhancers. PLoS Genet. 2014;10(9):e1004610. https://doi.org/10.1371/journal.pgen.1004610.
Lee SA, Kristjánsdóttir K, Kwak H. eRNA co-expression network uncovers TF dependency and convergent cooperativity. Sci Rep. 2023;13(1):19085.
Stefan K, Barksi A. Cis-regulatory atlas of primary human CD4+ T cells. BMC Genomics. 2023;24(1):253. https://doi.org/10.1186/s12864-023-09288-3.
Vockley CM, D’Ippolito AM, McDowell IC, Majoros WH, Safi A, Song L, Crawford GE, Reddy TE. Direct GR binding sites potentiate clusters of TF binding across the human genome. Cell. 2016;166(5):1269–81.
Mercer TR, Edwards SL, Clark MB, Nephh SJ, Wang H, Stergachis AB, John S, Sandstrom R, Li G, Sandhu KS, et al. DNase I-hypersensitive exons colocalize with promoters and distal regulatory elements. Nat Genet. 2013;45(8):852–9.
Partridge EC, Chhetri SB, Prokop JW, Ramaker RC, Jansen CS, Goh ST, Machiewicz M, Newberry KM, Brandsmeier LA, Meadows SK, et al. Occupancy maps of 208 chromatin-associated proteins in one human cell type. Nature. 2020;583(7818):720–8.
Sung MH, Guertin MJ, Baek S, Hager GL. DNase footprint signatures are dictated by factor dynamics and DNA sequence. Mol Cell. 2014;56(2):275–85.
Oh KS, Ha J, Baek S, Sung MH. XL-DNase-seq: improved footprinting of dynamic transcription factors. Epigenetics Chromatin. 2019;12(1):30. https://doi.org/10.1186/s13072-019-0277-6.
Calviello AK, Hirsekorn A, Wurmus R, Yusuf D, Ohler U. Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modelling. Genome Biol. 2019;20(1):42.
Kumasaka N, Knights AJ, Gaffney DJ. Fine-mapping cellular QTLs with RASQUAL and ATAC-seq. Nat Genet. 2016;48(2):206–13.
Huang M, Wangg Y, Yangg M, Yan J, Yangg H, Zhuangg W, Xu Y, Koeffler HP, Lin DC, Chen X. dbInDel: a database of enhancer-associated insertion and deletion variants by analysis of H3K27ac ChIP-Seq. Bioinformatics. 2020;36(5):1649–51.
Abramov S, Boystov A, Bykova D, Penzar DD, Yevshin I, Kolmykov SK, Fridman MV, Favorov AV, Vorontsov IE, Baulin E, Kolpakov FF, Makeev V, Kulakovskiy IV. ADASTRA [Internet]; 2021. Available from: https://adastra.autosome.org/bill-cipher/downloads?releaseName=Zanthar.
Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F et al. The UCSC Genome Browser Database: update 2006. NAR 2006;34(D):D590–8
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J. Scikit-learn: machine learning in Python. J Machine Learn Res. 2011;12:2825–30.
Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inform Process Syst. 2017;30. https://doi.org/10.48550/arXiv.1705.07874.
Steinhaus R, Robinson PN, Seelow D. FABIAN-variant: predicting the effects of DNA variants on transcription factor binding. NAR. 2022;50(W1):W322-329.
Huang L, Rosen JD, Sun Q, Chen J, Wheeler MM, Zhou Y, Min Y, Kooperberg C, Conomos MP, Stilp AM, et al. TOP-LD: a tool to explore linkage disequilibrium with TOPMed whole genome sequence data. Am J Human Genet. 2022;109(6):1175–81. https://doi.org/10.1016/j.ajhg.2022.04.006.
Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010;38(4):576–89.
Review Commons report 1. Early evidence base. 2024. https://doi.org/10.15252/rc.2024235645.
Review Commons report 2. Early evidence base. 2024. https://doi.org/10.15252/rc.2024765375.
Biddie SC. FINDER, GitHub. 2024. https://github.com/sbiddie/FINDER/tree/v1.0.
Biddie SC. FINDER, Zenodo.2024. https://zenodo.org/doi/10.5281/zenodo.12795448.
Review history
This article was first peer-reviewed at Review Commons. The reviewer reports are available online [75, 76]. The rest of the review history containing the authors’ responses and additional reviewer comments are available as Additional file 2.
Peer review information
Wenjing She was the primary editor of this article at Genome Biology and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Funding
SCB is funded by Chief Science Office and NHS Education Scotland (PCL/20/02). ETF was supported by a fellowship from the Swiss National Science Foundation (P500PB_206805). GW is supported by an MRC PhD studentship and WAB are supported by an MRC University Unit grant (MC_UU_000035/7).
Author information
Authors and Affiliations
Contributions
SCB and WAB conceived the study. SCB, GW, EFH, and ETF performed data analyses. SCB wrote the manuscript. All authors were involved in the preparation and editing of the manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
All datasets analyzed have been previously published and undergone ethical approval where appropriate. No ethical approval was required for this study.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Biddie, S.C., Weykopf, G., Hird, E.F. et al. DNA-binding factor footprints and enhancer RNAs identify functional non-coding genetic variants. Genome Biol 25, 208 (2024). https://doi.org/10.1186/s13059-024-03352-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13059-024-03352-1