IMAGE: high-powered detection of genetic effects on DNA methylation using integrated methylation QTL mapping and allele-specific analysis

Identifying genetic variants that are associated with methylation variation—an analysis commonly referred to as methylation quantitative trait locus (mQTL) mapping—is important for understanding the epigenetic mechanisms underlying genotype-trait associations. Here, we develop a statistical method, IMAGE, for mQTL mapping in sequencing-based methylation studies. IMAGE properly accounts for the count nature of bisulfite sequencing data and incorporates allele-specific methylation patterns from heterozygous individuals to enable more powerful mQTL discovery. We compare IMAGE with existing approaches through extensive simulation. We also apply IMAGE to analyze two bisulfite sequencing studies, in which IMAGE identifies more mQTL than existing approaches.

Most mQTL mapping studies thus far rely on DNA methylation data generated using array-based platforms [36][37][38]. However, the falling cost of sequencing and the development of high-throughput sequencing-based approaches to measure DNA methylation levels makes mQTL mapping using sequencing data increasingly feasible. Sequencing-based approaches offer several advantages. They can extend the breadth of DNA methylation analysis to the full genome (e.g., via whole genome bisulfite sequencing [39]), increase the flexibility to target specific regions of interest (e.g., via capture methods [40]), improve the representation of genomic regions or regulatory elements that are poorly represented on current array platforms (e.g., via reduced representation bisulfite sequencing [41,42]), and distinguish 5-hmc modifications from 5-mc modifications (e.g., via TETassisted pyridine borane sequencing [43] or TAB-seq approaches [44]). Further, unlike arrays, which are largely limited to studies in humans, sequencing-based approaches can be applied to any species [45][46][47][48]. Therefore, sequencing-based approaches have become the workhorse of major initiatives like the 1001 Genomes Project in the plant model system Arabidopsis thaliana [49,50]. Importantly, sequencing techniques also facilitate the estimation of allele-specific methylation levels, which should greatly improve the power of mQTL mapping approaches (as allele-specific expression estimates have been shown to do for eQTL mapping: [51,52]). Early attempts to perform mQTL mapping with bisulfite sequencing data have yielded promising results [35,49,53]. However, existing mQTL mapping methods are designed with array data in mind [37,38]. To maximize power, mQTL mapping using sequencing data requires new statistical method development that can properly account for two of its distinctive features.
First, methylation data collected in sequencing studies are counts, not continuous representations like those produced by arrays. Specifically, methylation-level estimates at a given cytosine base are based on both the total read count at the site and the subset of those reads that are unconverted by sodium bisulfite (or other processes [43]). Previous mQTL studies have dealt with these data by first computing a ratio between the methylated count and the total count, and then treating this ratio as an estimate of the true methylation level [35,49]. However, the count nature of the raw data means that the mean and variance of the computed ratio are highly interdependent. This relationship is not captured by previously deployed linear regression methods, which likely leads to loss of power. Indeed, similar losses of power are well documented for differential methylation analysis [40] and differential expression analysis of RNA-seq data [54][55][56][57]. To overcome this challenge, statistical methods for sequencing-based differential methylation analysis now adapt over-dispersed count models, including beta-binomial models [58][59][60][61][62] and binomial mixed models [40,63,64], to properly model the mean-variance relationship and potential overdispersion. In differential methylation analysis, these approaches can substantially improve power compared with normalization-based approaches [30,65,66]. Because mQTL mapping is conceptually similar and can be effectively viewed as genotype-based differential methylation analysis, extending over-dispersed binomial models to mQTL mapping is a promising approach.
Second, sequencing-based techniques are capable of measuring DNA methylation levels in heterozygotes in an allele-specific fashion (i.e., allele-specific methylation, ASM). When ASM estimates support differences in methylation levels between the two alleles carried by heterozygotes, they can be used to increase the power of mapping analysis. Indeed, assuming that additive genetic effects dominate, true cis-acting genetic differences in DNA methylation are expected to lead to both (i) differential methylation by genotype across all three genotypes at a biallelic site and (ii) ASM in heterozygotes. These two types of evidence are only available in sequencing studies, since ASM is not generally detectable when DNA methylation is profiled using arrays. Notably, previous methods for detecting genotype-dependent ASM suggest that it is common across tissue types and species, is more often explained by cis-acting variants than trans-effects, and is enriched near genes that also display patterns of allele-specific expression [67][68][69][70][71][72][73][74][75]. Thus, integrating ASM analysis into mQTL mapping analyses should also contribute to understanding the basis of cisregulatory effects on gene expression. There is strong precedent for such a combined strategy in other omics studies. For example, the methods implemented in Tre-CASE and WASP can integrate allele-specific expression information to greatly enhance the power of eQTL mapping [51,[76][77][78], and the software RASQUAL integrates allele-specific patterns with individual-level differences to facilitate QTL mapping of chromatin accessibility and ChIP-seq data [79]. However, to our knowledge, no method currently exists for integrating ASM with mQTL mapping in sequencing-based studies of DNA methylation.
Here, we develop a new statistical method for mQTL mapping in bisulfite sequencing studies that both accounts for the count-based nature of the data and takes advantage of ASM analysis to improve power. We refer to our method as IMAGE (Integrative Methylation Association with GEnotypes), which is implemented as an open-source R package (www.xzlab.org/software. html). IMAGE jointly accounts for both allele-specific methylation information from heterozygous individuals and non-allele-specific methylation information across all individuals, enabling powerful ASMassisted mQTL mapping. In addition, IMAGE relies on an over-dispersed binomial mixed model to directly model count data, which naturally accounts for sample non-independence resulting from individual relatedness, population stratification, or batch effects that are commonly observed in sequencing studies [40,57]. We develop a penalized quasi-likelihood (PQL) approximation-based algorithm [64,80,81] to facilitate scalable model inference. We illustrate the effectiveness of IMAGE and compare it with existing approaches in simulations. We also apply IMAGE to map mQTLs in two bisulfite sequencing studies from wild baboons and wild wolves.

Results
Method overview and simulation design IMAGE is described in detail in the "Materials and methods" section, with additional information provided in Additional file 1: Supplementary Text. Briefly, IMAGE combines the benefits of both standard mQTL mapping and ASM analysis by jointly modeling non-allele-specific (i.e., per-individual) methylation information across all individuals together with allele-specific methylation information (i.e., per-allele) from heterozygous individuals. This approach enables cis-mQTL mapping when the heterozygous SNP and the CpG site of interest are captured either on the same sequencing read or with known phasing information ( Fig. 1). By combining both allele-specific and nonallele-specific information, IMAGE improves power over traditional mapping approaches that use non-allele-specific information alone. In addition, IMAGE relies on a binomial mixed model to directly model count data from bisulfite sequencing and naturally accounts for over-dispersion as well as sample non-independence. IMAGE uses a penalized quasi-likelihood-based algorithm for scalable inference and is implemented in an open-source R package, freely available at http://www.xzlab.org/software.html.
We performed simulations to examine the effectiveness of IMAGE and compare it with other approaches. In each simulation, we started with real genotypes for n= 50-150 individuals [82] and examined power and accuracy over a range of parameters: the background heritability h 2 , the over-dispersion variance σ 2 , the SNP minor allele frequency MAF, the expected per-site total read TR across individuals, the average methylation ratio π 0 , the SNP effect size PVE, the sample size n, and the proportion of total environmental variance that is shared between two alleles ρ (a detailed explanation of these parameters is available in the "Materials and methods" section). In the simulations, we examined the role of each of these eight modeling parameters in determining mQTL mapping power. To do so, we first created a baseline simulation scenario where we set the simulation parameters to typical values inferred from real data [40] ("Materials and methods" section). Afterwards, we changed one parameter at a time to create different simulation scenarios and examined the influence of each parameter on method performance. In each scenario, we simulated 10,000 SNP-CpG pairs. For 9000 pairs, the methylation level at the CpG site was independent of the SNP genotype, while for the remaining 1000 pairs, CpG site methylation was associated with the SNP genotype, such that genotype explained a fixed proportion of methylation levels equivalent to the parameter PVE. After simulation, we discarded the methylation measurements for CpG sites on non-informative individuals (i.e., those with total read counts of zero). We then applied IMAGE and five other approaches to analyze each SNP-CpG pair separately.
The five other approaches perform mQTL mapping using different information: (1) IMAGE-I, a special case of IMAGE, which uses only non-allele-specific, individual-level information across all individuals; (2) IMAGE-A, another special case of IMAGE, which uses only allele-specific information from heterozygous individuals; (3) MACAU [40,57], which uses a binomial mixed model to perform mQTL mapping using only non-allele-specific information; (4) GEMMA [83][84][85], which uses a linear mixed model to perform mQTL mapping using only non-allele-specific information; and (5) BB, which implements a beta-binomial model [40] to perform mQTL mapping using only non-allele-specific information. Note that, with the exception of IMAGE and IMAGE-A, all methods perform mQTL mapping using only non-allele-specific information. In addition, with the sole exception of GEMMA, all methods model counts directly. For GEMMA, we used normalized data in the form of M values for analysis, following the previous literature [40,57]. We performed 10 simulation replicates (each consisting of 10,000 SNP-CpG pairs) for each scenario and computed power based on a known false discovery rate (FDR) for each scenario by combining simulation replicates.

Simulation results
Overall, the simulation results show that IMAGE outperforms all other methods across all tested parameters ( Fig. 2 and Additional file 2: Figure S1). For example, in the baseline simulation scenario, at an FDR of 0.05, IMAGE reaches a power of 57.15% in a sample size of 100 individuals. IMAGE-I, IMAGE-A, MACAU, GEMMA, and BB reach a power of 7.55%, 10.27%, 7.49%, 2.25%, and 6.79%, respectively. The ranking of different methods is not sensitive to different FDR cutoffs. For example, at an FDR of 0.1, the power of IMAGE is 68.78%, while the power of IMAGE-I, IMAGE-A, MACAU, GEMMA, and BB is 14.98%, 24.35%, 13.64%, 2.84%, and 15.03%, respectively. The superior performance of IMAGE suggests that incorporating ASM information into mQTL mapping can greatly enhance power.
Among the eight parameters we examined, six have similar effects on power across IMAGE and the five other models we compared. For example, the power of all methods increases with larger sample size n (Additional file 2: Figure S1A), larger genetic effect size PVE (Additional file 2: Figure S1B), larger minor allele frequency MAF (Additional file 2: Figure S1C), larger read depth TR (Additional file 2: Figure S1D), and larger over-dispersion variance σ 2 , which implicitly increases the genetic effect size PVE (Additional file 2: Figure S1E). In addition, the power of all methods is the highest for CpG sites with intermediate methylation level π 0 , but reduced for both hypomethylated and hypermethylated sites (Additional file 2: Figure S1F). The power dependence on π 0 is presumably because higher methylation variance in the middle range of π 0 leads to higher power.
Careful examination of the relative performance of different methods in different scenarios yields additional insights. First, among the mQTL mapping methods, we found that count-based approaches (IMAGE-I, MACAU, BB) often outperform a normalized data-based approach (GEMMA). Such performance differences become more apparent when sample size n is small (Additional file 2: Figure S1A), methylation level π 0 is either low or high (Additional file 2: Figure S1F), or mean per-site read depth TR is low (Additional file 2: Figure S1D). For example, when the mean total read TR = 10, the power of IMAGE-I, MACAU, and BB is 5.8%, 4.56%, and 5.33%, respectively (n = 100), while the power of GEMMA is only 1.01%. When TR increases to 30, the power of IMAGE-I, MACAU, and BB becomes 15.25%, 15.32%, and 14.55%, respectively, while the power of GEMMA remains low, at 6.14%. The superior performance of count-based methods is consistent with previous Fig. 1 Schematic of ASM-assisted mQTL mapping. The top three panels show bisulfite sequencing data mapped to a CpG site where methylation level is associated with a nearby SNP, in an AA homozygote (left), an AT heterozygote (middle), and a TT homozygote (right). Note that, while illustrated in the panels, the allele-level methylation information in the two homozygotes is not observed. The bottom three panels depict three methods to detect SNP-CpG association: the standard mQTL mapping approach (left) uses non-allele-specific information from all three individuals to detect an association, the standard ASM analysis (middle) uses allele-level information from the heterozygote only, and the joint analysis approach (right) presented here uses both types of information to achieve a gain in power. mQTL methylation quantitative trait loci, ASM allele-specific methylation observations [40,57], suggesting that modeling sequencing data in the original count form has added benefits for mQTL mapping. For DNA methylation levels, this advantage may arise in part because uncertainty in DNA methylation-level estimates is more accurately modeled in the count data than in normalized ratios. For example, a methylation level of one (completely hypermethylated) is strongly supported for a site-sample combination where read depth is very high, but weakly supported for combinations where read depth is low. The count-based methods effectively capture this distinction, which is lost in conversion to a single ratio.
Second, ASM-based approaches (IMAGE and IMAGE-A) often outperform mQTL mapping approaches that only use non-allele-specific data. This result holds even for IMAGE-A, even though it only models data for heterozygotes at nearby SNPs (and hence, uses only a subset of the data: 42% of the full set of simulated individuals on average). The generally higher power of ASM analysis likely stems from the fact that ASM methods control for both environmental and trans-acting genetic background effects (for each heterozygote, both alleles reside in the same individual, providing a natural internal control). Our simulations suggest that there are two important parameters that influence the relative power of ASM analysis and mQTL mapping. The first important parameter is background heritability, h 2 . Increased background heritability can reduce the performance of mQTL mapping methods, as increased confounding from polygenic effects of other SNPs likely increases the difficulty of identifying individual SNP associations [40,57]. For example, when h 2 = 0, the power of IMAGE-I, MACAU, GEMMA, and BB is 13.57%, 11.62%, 2.69%, and 13.88%, respectively. When h 2 increases to 0.6, however, the power of IMAGE-I, MACAU, GEMMA, and BB reduces to 6.48%, 7.05%, 1.50%, and 5.92%, respectively. In contrast, ASM analysis relies on a model that explicitly accounts for the heritable component that arises from genetic background effects, and thus achieves relatively stable performance. For example, when h 2 = 0, the power of IMAGE and IMAGE-A is 57.48% and 10.30%, respectively. When h 2 increases to 0.6, the power of IMAGE and IMAGE-A actually increases to 63.07% and 23.09%, respectively. This observation is consistent with the fact that the two alleles modeled in ASM, for each individual, share an identical genetic background that becomes easier to control for as its contribution to DNA methylation increases (i.e., as h 2 increases). Thus, IMAGE-I outperforms IMAGE-A when background heritability is zero (h 2 = 0), but performs worse when background heritability is moderate or high (h 2 = 0.3 or 0.6; Fig. 2a).
The second important parameter is the ratio parameter ρ, which represents the relative contribution of shared/ common environmental effects (i.e., the "trans" acting environment) and also influences the relative power of ASM vs mQTL. For mQTL methods, increasing ρ necessarily increases the contribution of common environmental noise shared between the two alleles. Common environmental noise is not explicitly accounted for by mQTL models, thus leading to a reduction in power. For example, when ρ = 0, IMAGE-I, MACAU, GEMMA, and BB detect 7.55%, 7.49%, 2.25%, and 6.79% of true effects, respectively. When ρ increases to 0.9, the power of IMAGE-I, MACAU, GEMMA, and BB reduces to 3.50%, 3.44%, 1.67%, and Power is measured by number of true mQTL detected at a false discovery rate (FDR) of 0.05. Each simulation setting is based on 10 simulation replicates, each including 10,000 simulated SNP-CpG pairs, 10% of which represent true mQTL. a We vary h 2 , the background heritability, to be either 0, 0.3, or 0.6, while maintaining other parameters at baseline. b We vary ρ, the proportion of common environmental variance, to be either 0, 0.3, or 0.9, while maintaining other parameters at baseline. The middle panel in a and the left panel in b correspond to the baseline simulation setting. Increasing both h 2 and ρ, which capture genetic and common environmental background effects, respectively, results in increased power for methods that use ASM information (IMAGE and IMAGE-A), but losses in power for methods that do not use ASM information (IMAGE-I, MACAU, GEMMA, BB). FDR false discovery rate 3.57%, respectively. In contrast, ASM analysis explicitly accounts for both common and independent environmental background effects, again because it measures DNA methylation in the two alleles in the same individual. ASM methods thus achieve better, not worse, performance with higher values of ρ. For example, when ρ = 0, the power of IMAGE and IMAGE-A is 57.15% and 10.27%, respectively. When ρ increases to 0.9, the power of IMAGE and IMAGE-A becomes 84.15% and 67.55%, respectively. Consequently, while mQTL methods have similar power as ASM when ρ is small, ASM can outperform mQTL when ρ is large (Fig. 2b).
Finally, we note that while we set PVE = 0.10 and h 2 = 0.30 in the baseline simulations to capture realistic effect sizes and background heritability across all SNP-CpG pairs genome-wide, reasonable data filtering decisions will often increase mean PVE and h 2 among SNP-CpG pairs tested in real data applications. For example, in the wolf and baboon data sets analyzed below, the median PVE was approximately 0.15 and the median h 2 estimate was near 0.5. For direct comparability, we therefore also created a simulation scenario in which we set PVE to 0.15 and h 2 to 0.50 (Additional file 2: Figure S1G). Notably, the relative power of different methods in this setting largely recapitulates our observations in the real data applications (see below).

mQTL mapping in wild baboons
We applied our method to analyze a reduced representation bisulfite sequencing data collected on 67 baboons from the Amboseli ecosystem of Kenya [40,45]. Detailed data description and processing steps are provided in the "Materials and methods" section, with an illustrative processing diagram shown in Additional file 2: Figure S3. Briefly, we extracted 49,196 SNP-CpG pairs from the bisulfite sequencing data, which consists of 13,753 unique SNPs and 45,210 unique CpG sites. We applied IMAGE together with the other five approaches described above to analyze each SNP-CpG pair individually. We performed permutations to estimate FDR for each method, and we report results based on a fixed FDR cutoff. Consistent with our simulations, our method achieves higher power compared with other methods in the baboon data set (Fig. 3a). For example, at an empirical FDR of 5%, IMAGE detected 7043 associated SNP-CpG pairs, which is 45% more than that detected by the next best method (IMAGE-A, which detected 4855 pairs at a 5% FDR). IMAGE-I, MACAU, GEMMA, and BB detected 3585, 3024, 2629 and 3259 pairs, respectively. Also consistent with the simulations, the higher power of IMAGE compared to other methods is robust with respect to different FDR cutoffs (Fig. 3a). We illustrate a few example sites that were only detected by IMAGE in Additional file 2: Figure S4. For these sites, methylation levels measured in the heterozygotes are noisy and often indistinguishable from at least one type of homozygote (often because total read counts are unevenly distributed across alleles). However, by separating methylation levels in heterozygotes into the contribution from each individual allele and modeling ASM information together with non-allele-specific information, IMAGE remains capable of identifying mQTLs in these sites. In addition, consistent with simulations, we also observed that our method could detect more associated SNP-CpG pairs with increasing MAF (Additional file 2: Figure S5A), increasing read depth TR (Additional file 2: Figure S5B), increasing sample size (Additional file 2: Figure S5C), or at intermediate methylation levels (Additional file 2: Figure S5D).
To validate the mQTLs we identified, we randomly split the sample into two approximately equal-sized subsets (one with 34 individuals and the other with 33 individuals) and examined the consistency of the SNP-CpG pairs detected in the two subsets. We removed IMAGE-A from this analysis as it requires at least five heterozygous individuals, which is no longer satisfied for many SNP-CpG pairs in each of the two subsets. For the remaining methods, we found that IMAGE detects more consistent SNP-CpG pairs between the two subsets than the other approaches (Fig. 3b). For example, among the top 5% (n = 2511) associated SNP-CpG pairs based on IMAGE, 53.8% of them were identified in both subsets. In contrast, among the top 5% (n = 2511) associated SNP-CpG pairs based on IMAGE-I, MACAU, GEMMA, and BB, 35.84%, 35.12%, 33.92%, and 37.64% overlapped between the two subsets. The greater consistency of results from IMAGE thus provides convergent support for its increased power.
Next, we assessed the set of detected SNP-CpG associations by performing functional enrichment analysis to compare our findings against published results (Fig. 3c). Here, we refer to the CpG sites with associated mQTL as mCpG sites. We examined whether the set of mCpG sites were enriched in CpG islands, CpG island shores, CpG island shelves, or genomic "open sea." To do so, we obtained functional genomic annotation information from the UCSC Genome Browser for the baboon genome, Panu2.0, and relied on the same criterion as [86] to annotate genomic regions (details in the "Materials and methods" section). For each annotated category, we then computed the proportion of mCpG sites in the annotated regions and contrasted it to the proportion of non-mCpG sites analyzed in our original mQTL mapping analysis. We found that mCpG sites are significantly enriched in open seas compared to non-mCpG sites (69.74% vs 66.08%; Fisher's exact test, p value = 0.0106) but underrepresented in CpG islands (11.16% vs 14.33%; p value = 1.056 × 10 −9 ). The results are consistent with previous observations [87,88], partly because CpG islands are often enriched in evolutionarily conserved promoter regions [89][90][91] that harbor fewer regulatory genetic variants and partly because power to detect mQTL is lower in hypomethylated regions [92]. The results are qualitatively consistent across sites with different mean CpG methylation levels, although do not reach statistical significance in all bins likely due to the smaller number of sites and the resulting lower power in each bin (Additional file 2: Figure S6). Importantly, despite the higher number of mCpG sites detected by IMAGE, the evidence for both enrichment in open sea and underrepresentation in CpG islands is also stronger in the IMAGE analysis than for other methods (Additional file 3: Table S1).
Finally, we counted the percentage of SNP-CpG pairs for which the SNP directly resides in the CpG sequence, abolishing the CpG site and therefore resulting in an entirely unmethylated alternate allele [69,93].  3 mQTL mapping results in the baboon RRBS data. a IMAGE identified more mQTL than the other five methods across a range of empirical FDR thresholds. b IMAGE identifies more consistent associations than the other methods in the subset analysis. Here, we randomly split individuals into two approximately equal-sized subsets and analyzed the two subsets separately using each method. We then counted the number of overlapping mQTL identified in both subsets. The overlap ratio (y-axis) is plotted against the percentage of top mQTL ranked by statistical evidence for a SNP-CpG methylation association in each method (x-axis). c Upper panel: log 2 odds ratio of detecting associated SNP-CpG pairs, together with the 95% CI, is computed for CpG sites residing in different annotated genomic regions. CpG sites with IMAGE-identified mQTL are enriched in open sea regions (p value = 0.0106) and depleted in CpG islands (p value = 1.056 × 10 −9 ). Bottom panel: all analyzed CpG sites were annotated to genomic regions based on their relation to the nearest CpG island. CpG islands were annotated based on the UCSC Genome Browser (average length = 672 bp in the data; min = 201 bp; max = 15,960 bp). Shore is the flanking region of CpG islands covering 0-2000 bp distant from the CpG island. Shelf is the region flanking island shores covering 2000-4000 bp distant from the CpG island. d A higher percentage of CpG sites are directly disrupted by the SNP in mQTL pairs compared to by chance alone (horizontal dashed line), and more so than in non-mQTL pairs (p value < 2.2 × 10 −16 ). Such enrichment decays with increased FDR thresholds. *p < 0.05, **p < 0.01 These sites, by definition, should exhibit mQTL and ASM. Four hundred three sites in our data set were disrupted by SNPs, and 59.6% of them (n = 240) were indeed identified as significant mCpG sites. For 95.70% of those we did not detect (n = 156), the non-disrupted CpG was also hypomethylated in our sample (< 10% methylation level), which would make it impossible to detect an mQTL (i.e., because both disrupted and nondisrupted alleles are hypomethylated). CpG sites disrupted by SNPs accounted for 3.72% of significant mCpG sites (compared to the 0.89% expected by chance), but only 0.43% of non-mCpG sites, in support of the accuracy of our mQTL mapping approach (Fisher's exact test p value < 2.2 × 10 −16 ). In addition, as expected, the percentage of significant mCpG sites accounted for by CpG sites disrupted by SNPs gradually decreases with less stringent FDR cutoffs (Fig. 3d). Importantly, IMAGE also outperforms the other five methods on this metric (Additional file 3: Table S2).

mQTL analysis in wild wolves
Finally, we applied IMAGE to analyze a second RRBS data set collected on 63 gray wolves from Yellowstone National Park [46,94]. We applied the same data processing procedure described above for baboons, followed by mQTL mapping. In total, we extracted 279,223 SNP-CpG pairs from the bisulfite sequencing data, which consists of 77,039 unique SNPs and 242,784 unique CpG sites. IMAGE again achieved higher power compared with the other methods (Fig. 4a). At an empirical FDR of 5%, IMAGE detected 34, 779 significantly associated SNP-CpG pairs, which is 50% more than that detected by the next best method (IMAGE-A), and 262% more than the other four methods (Fig. 4a and Additional file 2: Figure S7). As in the baboons, subset analysis confirmed that IMAGE detects more consistent SNP-CpG pairs than the other approaches (Fig. 4b). For example, among the top 5% (n = 14,091) associated SNP-CpG pairs based on IMAGE analysis, 53.8% of them are consistent between the two subsets, compared to 20.5-30.7% for the other four methods tested. Consistent with results from simulations and the baboon data, we also observed that our method could detect more associated SNP-CpG pairs with intermediate methylation levels, increasing MAF, increasing read depth, and increasing sample size (Additional file 2: Figure S5).
Finally, consistent with the baboon results, mCpG sites in the wolves were significantly enriched in open sea compared to non-mCpG sites (31.77% vs 26.31%; p value <2.2 × 10 −16 ) and were underrepresented in CpG islands (30.17% vs 37.43%; p value < 2.2 × 10 −16 ) (Fig. 4c). In the wolves, we also observed significant (albeit much weaker) enrichment of mCpG sites in shelf regions (12.49% vs 11.63%; p value = 9.001 × 10 −5 ) and shore regions (25.57% vs 24.64%; p value = 5.890 × 10 −3 ). The higher frequency of mCpG sites in CpG island shelves and shores is consistent with previous studies [87,88] and likely reflects greater power to detect enrichment in the wolf data set, which yields a larger number of analyzable SNP-CpG pairs than in the baboons (m = 242,784 in wolf vs m = 45,210 in baboon).
The enrichment in open sea and underrepresentation of mCpG sites in CpG islands are robust regardless of whether we stratify sites based on mean methylation levels, although the shelf/shore results are noisier (Additional file 2: Figure S8). Again, we found that enrichment results were stronger in the IMAGE analysis than when using other methods (Additional file 3: Table S3) and that mCpG sites were more likely to be disrupted by their associated SNPs than non-mCpG sites (3.66% vs 0.18%; p value < 2.2 × 10 −16 ) ( Fig. 4d; see also Additional file 3: Table S4).

Discussion
Here, we present IMAGE, a new statistical method with a scalable computational algorithm, for mQTL mapping in bisulfite sequencing studies. IMAGE relies on a binomial mixed model to account for the count nature of over-dispersed bisulfite sequencing data, models multiple sources of methylation-level variance, and incorporates allele-specific methylation patterns from heterozygous individuals into mQTL mapping. Both simulations and two real data sets support its increased power over other commonly used methods.
A key feature of our method is its ability to incorporate allele-specific methylation information into mQTL mapping. In RNA sequencing studies, it has been well documented that incorporating ASE information can greatly improve the power of eQTL mapping [51,[76][77][78]. Our results confirm that this observation generalizes to mQTL mapping and provides substantial benefits over approaches that cannot or do not use allele-specific data. Notably, these benefits are not limited to the RRBS data we examined here: IMAGE can also be applied to analyze data generated via whole genome bisulfite sequencing (WGBS) [39] or by newer approaches that distinguish 5-hmc modifications from 5-mc modifications [43,44]. Doing so would greatly facilitate detection of methylation-associated genetic variants genome-wide, including variants associated with different types of methylation marks.
Notably, although secondary to the methods advance itself, our real data applications show that mQTL mapping can be successfully executed using bisulfite sequencing data alone, in the absence of independently generated genotype data. Specifically, we used the same bisulfite sequencing data set to both extract methylation measurements and call SNP genotypes. Our approach dovetails with previous observations that accurate genotyping data can be obtained from RNA sequencing data [95], bisulfite sequencing data [78], or ChIP sequencing data [96], which simultaneously reduces experimental cost and increases the utility of different sequencing data types. Because of these benefits, molecular QTL mapping without separate DNA sequencing or genotyping is gaining popularity [97]. For example, a recent study performed eQTL mapping and ASE analysis using RNA sequencing alone and demonstrated that this strategy achieves approximately 50% power compared to traditional eQTL mapping strategies that rely on independently derived genotype data, even though it only uses the 12.66% of SNPs represented in blood-derived RNA-seq reads [45]. Here, we also show that genotyping and phenotyping from the same data set can facilitate well-powered mQTL mapping. Notably, unlike RNA-seq data, because allele-specific methylation information is represented as the ratio between methylated reads and total reads mapped to the same allele, our approach is also less likely to be affected by allele-specific mapping biases (mitigating another argument for generating independent genotype data). Thus, our mQTL mapping approach has the potential to both increase the utility and applicability of functional genomic data types and improve accessibility of this type of analysis across species.
Our method is not without limitations. For example, to enable ASM-assisted mQTL mapping, our method makes a key modeling assumption that the allelic effect size estimated from heterozygotes is equivalent to the , and more so than in non-mQTL pairs (p value < 2.2 × 10 −16 ). Such enrichment decays with increased FDR thresholds. *p < 0.05, **p < 0.01 genotype effect size estimated from mQTL mapping across all genotype classes. This assumption is generally satisfied for cis genetic effects when the SNP is close to the CpG site [98], and is shared, for gene expression phenotypes, with ASE-assisted eQTL mapping methods (e.g., TreCASE and WASP [51,52]). However, in rare occasions, the equal effect size assumption may be violated. For example, if ASM arises because of genomic imprinting instead of sequence variation, the allelic effect size may be much smaller than the mQTL effect size obtained across all individuals. Such a violation would lead IMAGE to lose power relative to classical mQTL mapping approaches. Notably, imprinted regions are quite rare in vertebrate genomes (less than 1% of genes are imprinted) [99]. However, excluding imprinted loci prior to IMAGE mapping or substituting the IMAGE-I approach for these loci may slightly improve performance. Additionally, in unphased data, an important limitation of IMAGE is that it can only be used to analyze adjacent SNP-CpG pairs that are covered by the same sequencing reads. Analyzing only adjacent SNP-CpG pairs can limit the discovery of mQTLs. Therefore, it would be important to extend IMAGE to analyze distant SNP-CpG pairs in unphased data, using, for example, strategies presented in [100]. Certainly, if SNP data can be phased, IMAGE can also be applied to analyze SNP-CpG pairs that are separated by longer distances. In principle, using phased data could improve mQTL mapping power even further, if physically linked CpG sites display consistent ASM. Because the baboon and wolf data we analyzed here are not associated with an extensive genetic reference panel, we did not attempt to extend our analysis to phased data. Nevertheless, exploring the benefits of phased data or extending IMAGE to analyzing distant SNP-CpG pairs in unphased data is an important future direction.
Another limitation of IMAGE is that type I error may not be well controlled when methylation background heritability is high (> 0.6, Additional file 3: Table S5), when the sample size is small (< 100, Additional file 3: Table S6), or when the genotype minor allele frequency is low (< 0.1, Additional file 3: Table S7). As a result, we recommend calibrating the false discovery rate against a permutation-derived empirical null, as we have done here (we note that calibrating against permutations has become an increasingly common approach in functional genomic mapping studies in any case [101,102]). Finally, while our method is reasonably efficient and can be readily applied to analyze hundreds of individuals and tens of thousands of SNP-CpG pairs (Table 1), new algorithms will be needed to adapt IMAGE to data sets that are orders of magnitude larger.
Nevertheless, in its current form, IMAGE is wellsuited to analyzing sequencing-based DNA methylation data sets of the size and scale typically generated in recent studies [103]. Thus, it can be flexibly deployed to investigate the genetic architecture of gene regulatory variation, the relative role of genes and the environment in shaping the epigenome, or the mediating role of DNA methylation in linking environmental conditions to downstream phenotypes, including human disease (e.g., via Mendelian randomization or related approaches [104,105]).

Method overview
Both mQTL mapping and ASM analysis examine one CpG site-SNP pair at a time to identify SNPs associated with DNA methylation levels. However, these two approaches rely on different information to model the genotype-DNA methylation-level relationship. Specifically, mQTL mapping focuses on modeling the methylated read counts and total read counts at the individual level across all samples, without differentiating between the contributions from the two alleles contained within each individual. In contrast, ASM analysis focuses on modeling methylated read counts and total read counts in an allelespecific fashion, restricting it to heterozygotes for the SNP of interest (otherwise, the contributions of each allele cannot be decoupled). mQTL mapping has the benefit of using the entire sample, not just heterozygotes. In contrast, ASM has the benefit of internal control, since both alleles within each heterozygote experience the same genetic and environmental background.
To take advantage of both approaches, IMAGE independently models each CpG-SNP site pair. For each individual measured at a CpG-SNP pair, we denote y i and r i as the methylated read count and total read count for the ith individual (combined across alleles), for i = 1, ⋯, n. We denote the corresponding methylated and total read counts mapped to each of the two alleles of the ith individual as y il and r il , for l = 1 or 2. Thus, y i = y i1 + y i2 and r i = r i1 + r i2 . Note that y il and r il are only observed in heterozygotes, so are treated as missing data in homozygotes (more details below). We then model the methylated read counts for each allele as a function of the total read counts for the same allele using a binomial model: where π il is the true methylation level for the lth allele in the ith individual. We further model the logittransformed methylation proportion π il as a function of allele genotype: where μ is the intercept; x il is the lth allele type for the ith individual for the SNP of interest (x il = 0 or 1, corresponding to the reference allele and alternative allele, respectively); and β is the corresponding allele/genotype effect size. In addition to these fixed effects, we model three random effects to account for different sources of over-dispersion. Specifically, g i represents the genetic background/polygenic effect on DNA methylation for the ith individual and can be used to account for kinship or other population structure in the sample. We assume g ¼ ðg 1 ; ⋯; g n Þ T MVNð0; σ 2 g K Þ, where K is a known n by n genetic relatedness matrix that can be estimated either from genotype or pedigree data. u i represents individual-level environmental effects that we assume are independent across individuals but shared between the two alleles within the same individual. We assume u i Nð0; σ 2 u Þ. Finally, e il represents the residual error and is used to account for independent noise that varies across both individuals and alleles (e.g., stochastic events). We assume e il Nð0; σ 2 e Þ . We standardize the genetic relatedness matrix K to ensure that the mean of the diagonal elements of K equals 1, or trðK Þ n ¼ 1. When this is the case, h 2 ¼ , and can be interpreted as the approximate background heritability of DNA methylation levels (details in Additional file 1: Supplementary  Text). Here, the background heritability represents the proportion of variance in the latent parameter λ explained by the genetic effects from all SNPs other than the SNP of focus (i.e., x). Therefore, the background heritability is the usual heritability minus the genetic effect of x. Our primary goal is to test the null hypothesis that genotype is not associated with methylation levels, or equivalently, H 0 : β = 0. While the above model is fully specified for heterozygous individuals, it is not fully specified in homozygotes, where y il and r il are not observed. For homozygotes, only the sums of the reads across both alleles, y i = y i1 + y i2 and r i = r i1 + r i2 , are observed. Therefore, for homozygotes, we derive a model for y i and r i based on Eq. (1) by summing over all possible values of y il and r il : In Eq. (3), we assume that the model specified in Eq. (1) for the two alleles are independent of each other; thus, P(y i1 , y i2 | r i1 , r i2 , π i1 , π i2 ) = P(y i1 | r i1 , π i1 )P(y i − y i1 | r i − r i1 , π i2 ). We further assume that P(r i1 | r i ) follows a binomial distribution r i1~B in(r i , 0.5), which reflects the assumption that both alleles are equally likely to be represented in the sequencing data. Even with these two assumptions, the probability P(y i | r i , π i1 , π i2 ) in Eq. (3) does not have an analytic form and can only be evaluated numerically, which is highly computationally inefficient for parameter estimation and inference. To enable scalable computation, we therefore approximate the distribution in Eq. (3) using a binomial distribution (details in Additional file 1: Supplementary Text). Numerical simulations demonstrate the accuracy of this approximation across a range of settings (Additional file 2: Fig. S9).
The model defined in Eqs. (1), (2) (for heterozygous individuals), and (3) (for homozygous individuals) allows us to perform ASM-assisted mQTL mapping to identify SNPs associated with DNA methylation levels. Due to the random effects terms in the model, the joint likelihood based on these equations consists of a high-dimensional integration that cannot be solved analytically. Here, we rely on the penalized quasi-likelihood (PQL) algorithm that is commonly used for fitting generalized linear mixed models [64,80,81] to perform parameter estimation. Based on the parameter estimates, we further calculate a Wald statistic for testing the null hypothesis that H 0 : β = 0 and obtaining a corresponding p value.
We refer to the above model as IMAGE, which is implemented as a freely available R software package at www.xzlab.org/software.html.

Simulations
We performed simulations to examine the effectiveness of our method and compare it with other approaches.
To do so, we first randomly selected 150 individuals from the 1958 birth cohort study, which is a part of the control samples that were used in the Wellcome Trust Case Control Consortium Study (WTCCC) [82]. We then obtained genotypes for 394,117 SNPs on chromosome 1 for these selected individuals. In the simulations, we examined the influence of sample size on power by choosing three different sample sizes: n= 50, 100, or 150. For n = 150, we used all 150 samples; for n < 150, we randomly selected the corresponding number of individuals from the 150 samples. For each simulation replicate, we computed the genetic relatedness matrix K from the SNP data using GEMMA [83][84][85]. We examined the influence of SNP minor allele frequency (MAF) on power by dividing the 394,117 SNPs into three different MAF bins: an MAF bin centered on 0.1, which contains SNPs with an MAF between 0.05 and 0.15 (p = 100,631); an MAF bin centered on 0.3, which contains SNPs with an MAF between 0.25 and 0.35 (p = 51,800); and an MAF bin including 0.5 which contains SNPs with an MAF between 0.45 and 0.50 (p = 23,619). To simulate SNP-CpG site pairs, given a combination of sample size and MAF bin, we randomly selected one SNP from the appropriate MAF bin and simulated methylation counts and total read counts based on the following procedure.
For the total read counts, we first used a negative binomial distribution NB(TR, ϕ) to simulate the total read count r i for each individual. Here, TR is the mean parameter and ϕ is the dispersion parameter. We set TR= 10, 20, or 30, close to the median estimate across all CpG sites from the baboon data (details of the data are described in the next section; median estimate in the real data = 23). We set ϕ= 3, which is close to the median estimates obtained from the baboon data (median estimate in the real data = 2.80). To obtain the total read count mapped to each of the two alleles, we further simulated a proportion parameter q i , which represents the proportion of reads mapped to one allele out of the two alleles. Specifically, q i was simulated from a beta distribution Beta(a, b), where we set the shape parameters a and b to both be 10, so that the simulated q i is symmetric around 0.5 and is within the range of (0.3, 0.7) in 93.6% of cases. With r i and q i , we simulated the total read count mapped to one of the two alleles from r i1~B in(r i , q i ) and set the total read count mapped to the other allele as r i2 = r i − r i1 .
For the methylated read counts, we performed simulations using a combination of five parameters. These five parameters include the intercept μ, which characterizes the baseline methylation level (interpretable as the mean methylation level within a given population); h 2 , which represents background heritability; σ 2 , which is the overdispersion variance; ρ, which characterizes the proportion of common environmental variance (i.e., for those effects that are shared between the two alleles in each individual) with respect to both the common environmental variance and the independent environmental variance that is independent between both individuals and alleles within individuals; and PVE, which represents genotype effect size in terms of proportion of phenotypic variance explained (PVE) by genotype. With these four parameters, we first simulated the genetic random effects g = (g 1 , ⋯, g n ) T (an n-vector) across all individuals from a multivariate normal distribution with covariance ð1þρÞh 2 2þðρ−1Þh 2 σ 2 K to guarantee that the background heritability for our population of simulated individuals is h 2 (details in Additional file 1: Supplementary Text). For each individual at a time, we then simulated the environmental random effects (e i1 , e i2 ) and u i together as a bivariate vector (u i + e i1 , u i + e i2 ) T from a bivariate normal distribution with a covariance Σ, where Σ For sites where methylation level was not associated with genotype, the SNP effect β was set to zero and the background genetic effects, environmental effects, and an intercept (μ) were then summed together to yield the latent variable π il through logit(π il ) = logit(π 0 ) + g i + u i + e il for the lth allele in ith individual. For sites with true mQTL, we used logit(π il ) = logit(π 0 ) + x il β + g i + u i + e il to yield the latent variable π il , where x il is the allele genotype for the lth allele in the ith individual. We randomly draw β Nð0; σ 2 b Þ for each CpG site in turn, where σ 2 b is set to ensure that genetic effects explain a fixed PVE in logit(π il ), on average. We set PVE to be 5%, 10%, or 15% to represent different mean mQTL effect sizes, and we derive σ 2 b ¼ PVE σ 2 ð1−PVEÞV ðxÞ , where the function V (•) denotes the sample variance computed across individuals with x being a genotype vector of size n. Finally, we simulated the methylated read counts for each allele based on a binomial distribution with a rate parameter determined by the total read counts r i and the methylation proportion π il ; that is, y il~B in(r il , π il ) for the lth allele in ith individual. For heterozygotes, we retained the allele-level data (y i1 , y i2 ) and (r i1 , r i2 ). For homozygotes, we collapsed the allele-level data into individual-level data, y i = y i1 + y i2 and r i = r i1 + r i2 .
Using the procedure described above, we first simulated data under a baseline simulation scenario of n = 100, h 2 = 0.3, π 0 = 0.5 , MAF = 0.3, ρ = 0, TR = 20, σ 2 = 0.7, and PVE = 0.1 for mQTL sites. We then varied one parameter at a time to generate different simulation scenarios to examine the influence of each parameter, following [40]. Here, we varied the baseline methylation level π 0 to be either 0.1, 0.5, or 0.9 to represent low, moderate, or high levels of DNA methylation. We varied h 2 = 0.0, 0.3, or 0.6 to represent no, medium, or high background heritability. We varied σ 2 = 0.3, 0.5, or 0.7 to represent different levels of over-dispersion. We varied ρ= 0, 0.3, or 0.9 to represent different levels of common environment influence. For each simulated combination of parameters, we performed 10 simulation replicates consisting of 10,000 CpG sites each. Among these sites, DNA methylation levels at 1000 of them were associated with the SNP genotype (β ≠ 0) while DNA methylation levels for the remaining 9000 were not (β = 0).

Baboon RRBS data
We applied our method to a bisulfite sequencing data set from 69 wild baboons from the Amboseli ecosystem in Kenya [40,45]. These data were generated using RRBS on the Illumina HiSeq 2000 platform, with 100-bp single-end sequencing reads. We obtained the raw fastq files from NCBI (accession number PRJNA283632), removed adaptor contamination and low-quality bases using the program Trim Galore (version 0.4.3) [106], and then mapped reads to the baboon reference genome (Panu2.0) using BSseeker2 [107] (Additional file 2: Figure S3; more details in Additional file 1: Supplementary Text). After removing two samples that had extremely low sequencing read depths (57,734 and 58,070 reads, respectively), sequencing read depth ranged from 5.00 to 79.78 million reads (median = 24.48 million reads; sd = 13.69 million).
We performed SNP calling in the bisulfite sequencing data using CGmaptools, a SNP calling program specifically designed for bisulfite sequencing data. CGmaptools examines one individual at a time using the BayesWC SNP calling strategy [78]. Following the authors' recommendations, we used a conservative error rate of 0.01 and a dynamic p value to account for different read depth per site. Further, we modified the source code to make CGmaptools output homozygous reference genotypes as well. After SNP calling, we indexed and merged variant call files (VCFs) using VCFtools [108]. We then obtained a common set of SNPs where the position was called in at least 50% individuals (including homozygous reference calls). For each individual, we filtered out SNPs that were called using less than three reads. For each SNP, we filtered out variants that had an estimated MAF < 0.05. Finally, we filtered out 989 multiallelic SNPs to obtain a final call set of 289,103 analysis-ready SNPs (mean = 203,864 SNPs typed per sample; median = 204, 554; sd = 34,768). We computed the genetic relatedness matrix K in GEMMA, using this SNP data set.
To validate the SNP genotype data, we compared the variants identified from the bisulfite sequencing data to a set of previously identified SNP variants in baboons [109]. These previously identified SNPs were obtained from 44 different wild baboons from East Africa, including members of the baboon population from which the RRBS data were generated but also members of baboon populations outside Amboseli, via low-coverage DNA sequencing (range 0.6× to 4.35×; median = 1.91×; sd = 0.77×). This data set identified a total of 24,770,393 SNPs, with an average of 17,725,780 SNPs genotyped per individual (median = 18,139,340; sd = 4,315,590). Because of the low sequencing depth in the DNA sequencing data set, we expected that variants called from the bisulfite sequencing data would not completely overlap with variants identified from the DNA sequencing data. Indeed, we found that 50.9% of our called variants are located at a known variant from the DNA sequencing study, with the remaining SNPs being novel. Importantly, among overlapping variants, 99.5% have the same alternate allele, in support of the accuracy of SNP calling from bisulfite sequencing data. Additionally, we observe more overlap in called variants with higher alternate allele frequency, reaching 72.5% for variants with an alternate allele frequency > 0.5 in the RRBS data (Additional file 2: Figure  S10A). The allele frequency estimates from the two data sets for overlapping variants are reasonably well correlated (Spearman correlation r = 0.551; p value < 2.2 × 10 −16 ; Additional file 2: Figure S10B).
In addition to genotyping, we used CGmaptools to obtain CpG-SNP pairs where the SNP and CpG site were profiled on the same sequencing read. The distance between the SNP-CpG site pairs ranges from 1 to 104 bp, with a median distance of 37 bp (mean = 39.75 bp; sd = 26.15 bp; Additional file 2: Figure S10C). We extracted the methylation-level estimates for each CpG site in the form of the number of methylated read counts and the number of total read counts, at the individual level for homozygotes and for each allele separately for heterozygotes. We obtained a total of 522,965 SNP-CpG pairs, with 82,217 unique SNPs and 391,137 unique CpG sites. Following [49], we excluded CpG sites (i) that were measured in less than 20 individuals, (ii) where methylation levels fell below 10% or above 90% in at least 90% of measured individuals, (iii) that had a mean read depth less than 5, or (iv) that were paired with a SNP with MAF < 0.05 across individuals for whom DNA methylation estimates were available. To avoid potential mapping bias, we also excluded CpG sites with apparent differences in methylation levels between reference and alternate alleles that were larger than 0.6. Note that excluding these sites is a conservative strategy and may remove truly associated SNP-CpG pairs where mQTL are unusually large effect size. After filtering, our final data consisted of 49,196  To check the quality of DNA methylation estimates for these CpG sites, we examined their distribution across individuals. Similar to other RRBS data sets [110], we observed a bimodal distribution pattern of methylation levels, including a large number of hypomethylated and hypermethylated CpG sites (Additional file 2: Figure S10D). Next, we examined the accuracy of methylation measurements obtained from our pipeline by comparing the mean methylation at each CpG site obtained here to those estimated in a previous study that focused on a subset of 61 individuals but used a different mapping and DNA methylation estimation pipeline [111]. As expected, the overall distribution of DNA methylation levels is almost identical between our pipeline and the previous study for the 15,605 overlapping sites (Additional file 2: Figure S10E). In addition, sitespecific DNA methylation-level estimates are highly correlated (Spearman correlation r = 0.855, p value <2.2 × 10 −16 ; Additional file 2: Figure S10E). Finally, we checked whether our data suggest mapping bias in favor of the reference allele. Among the CpG sites we analyzed, we observed no bias in methylation-level estimates between the reference and the alternate alleles (Additional file 2: Figure S10F).
We applied five different approaches (details in the "Results" section), together with our primary IMAGE method, to analyze the baboon DNA methylation data. Most of these methods are count based, and algorithms for count-based models can be computationally unstable in the presence of covariates. To control for confounding effects from covariates, for each SNP in turn, we removed the effects of age, sex, and the top two methylation principal components based on M values [112] and used the genotype residuals for analysis. One method, IMAGE-A, requires a relatively large number of heterozygous individuals and was thus only applied to analyze sites for which we identified at least 5 heterozygotes (38,250 SNP-CpG pairs). All other methods were applied to all 49,196 SNP-CpG pairs. Because different methods have different type I error control and one method (IMAGE-A) analyzes a different number of SNP-CpG pairs, to ensure fair comparison, we performed permutations to construct empirical null distributions. Specifically, we combined the count data from the heterozygotes (y i1 , y i2 ), (r i1 , r i2 ) with the count data from the homozygotes (y i , r i ), treated the two alleles of each heterozygote as two samples and treated each homozygote as one sample, permuted the sample label 10 times to create null permutations, and applied each method to analyze the permuted data. We note that an alternative permutation strategy would be to permute (y i , r i ) along with covariates across individuals. In this strategy, the number of methylated reads for each allele (out of total reads for each allele) in heterozygotes could then be sampled from a binomial distribution with probability 0.5, conditional on y i and r i − y i respectively. This alternative strategy is not ideal for small sample sizes, but is likely to work well for large samples (approximately n > 150). Therefore, we have also implemented this alternative permutation strategy in the software and recommend users to explore both strategies and select one that performs the best for their data. Regardless of which permutation strategy one uses, the statistics from the permuted data allowed us to construct an empirical null distribution. With the empirical null distribution, we estimated the empirical false discovery rate (FDR) for different methods at different p value thresholds. We then compared the number of associations detected by different methods at a fixed FDR cutoff.
Finally, following [86], we annotated CpG sites into four categories based on genomic locations obtained from the UCSC Genome Browser: island, shore, shelf, and open sea. CpG islands are defined as short (approximately 1 kb) regions of high CpG density in an otherwise CpG-sparse genome [113]. A large proportion of CpG islands have been shown to be associated with gene promoters [114,115]. The methylation level at the CpG islands is often associated with transcription repression [116,117]. CpG shores are defined as the 2 kb of sequence flanking a CpG island, and CpG shelfs are defined as the 2 kb of sequence further flanking CpG shores. Both CpG shores and shelfs have been reported to be more dynamic than the CpG island itself [90,118,119]. The methylation variation at shores and shelfs have been associated with various diseases. Finally, the remaining regions outside of CpG island/shore/shelf are denoted as open seas [120]. We downloaded the CpG island annotations for Panu2.0 directly from the UCSC Genome Browser, annotated the 2-kb region upstream and downstream of the CpG island boundaries as the shore, annotated the 2-kb regions upstream and downstream of the CpG shores as the CpG shelves, and annotated the remaining regions as open sea (Fig. 3e).

Wolf RRBS data
We also applied our method to analyze a second bisulfite sequencing data set, from 63 gray wolves from Yellowstone National Park in the USA [46,94]. The wolf data are RRBS data collected on the Illumina HiSeq 2500 platform using 100-bp single-end sequencing reads. We obtained bam files for 35 individuals from NCBI (accession number PRJNA299792) [46] and the fastq files for the remaining individuals from accession number PRJNA488382 [94]. We processed all files using the same procedure described in the previous section, using Trim Galore and BSseeker2, with the dog genome canFam 3.1 [121] as the reference genome. Perindividual sequencing read depth ranges from 9.53 to 75.18 million reads per individual (median = 31.36 million reads; sd = 12.91 million). We used the same SNP calling procedure described for baboons and applied the same filtering criteria to obtain a final call set of 518,774 SNPs, with an average of 360,063 SNPs genotyped per individual (median = 440,898; sd = 103,522). We also computed the genetic relatedness matrix K with these SNPs using GEMMA.
To validate variants identified in the wolf data set, we compared the called variants from the bisulfite sequencing data to an existing SNV data base from the current Ensembl release for the dog genome canFam 3.1. We found that 17.9% of variants overlapped with known variants from Ensembl. Importantly, among overlapping variants, 99.1% of them have the same alternative allele as reported in Ensembl. In addition, the proportion of overlapping variants increases with increasing alternate allele frequency and reaches 41.3% when we focus on variants that have an alternate allele frequency > 0.5 in the RRBS data (Additional file 2: Figure S11A).
We followed the same procedure described for baboons to extract methylation measurements on SNP-CpG pairs. In the wolves, the distance between SNP-CpG site in each pair ranges from 1 to 103 bp, with a median of 35 bp (mean = 38.41 bp; sd = 25.63 bp; Additional file 2: Figure S11B). We obtained a total of 861, 474 SNP-CpG pairs, representing 144,670 unique SNPs and 684,681 unique CpG sites. Following quality control filtering, we obtained a final set of 279,223 SNP-CpG pairs, representing 77,039 unique SNPs and 242,784 unique CpG sites, with an average of 179,412 SNP-CpG pairs measured per individual. In this set, the median number of reads per SNP across all individuals is 25 (mean = 31.16; sd = 29.33) and the median number of reads per allele is 14 in heterozygotes (mean = 17.45; sd = 18.90). Methylation levels across sites display the expected bimodal distribution pattern (Additional file 2: Figure S11C), and we observed no bias in methylationlevel estimates between the reference and the alternate alleles (Additional file 2: Figure S11D).
We applied the same analysis procedure to analyze the wolf data as we did for the baboon data set. IMAGE-A was used to analyze 236,092 SNP-CpG pairs where the data set included at least 5 heterozygotes while the other methods were applied to all 279,223 SNP-CpG pairs. We used permutation to construct empirical null distributions for FDR control and controlled for the effects of sex and the top two methylation principal components in the same procedure described in the baboon data. Finally, we annotated CpG sites into island, shore, shelf, and open sea categories as described above, based on the canFam3.1 genome.