RADAR overcomes challenges in modeling MeRIP-seq data and accommodates complex study designs
Using BAM files as the input, RADAR first divides transcripts (concatenated exons) into 50-bp consecutive bins and quantifies pre-IP and post-IP read counts for each bin (Fig. 1a). Unlike current differential methylation analysis methods [8,9,10,11] that scale to library sizes as a way of normalization, which can be strongly skewed by highly expressed genes [16] (Additional file 1: Figure S1), RADAR uses the median-of-ratio method [17] implemented in DEseq2 to normalize the INPUT library for the sake of robustness. For the IP library, RADAR normalizes the fold enrichment computed from the IP counts divided by the INPUT counts, which takes both IP efficiency and IP library size variation into account.
After proper normalization across all samples, RADAR then calculates the methylation level for each bin conditioned on its pre-IP RNA expression level for each sample. In contrast to previous methods [8,9,10,11] that use peak-level read counts in the INPUT library as its measurement of pre-IP RNA expression level, we use gene-level read counts as a more robust representation, which is defined as the total number of reads across all bins that span the same gene (Fig. 1a). This choice is motivated by the observation that the median read coverage within each peak is very low—18 reads per peak (7 reads in a 50-bp bin) (Additional file 1: Figure S2) in a typical MeRIP-seq input sample of 20 million (mappable) reads (Additional file 1: Figure S3). Over-dispersion of low counts due to random sampling in the sequencing process can introduce substantial unwanted variation to the estimation of pre-IP RNA level. This can be further exacerbated by the uneven distribution of reads caused by local sequence characteristics such as GC content and mappability. Using gene-level counts as the estimate of pre-IP RNA expression level can mitigate the dispersion by increasing the number of reads (272 reads on average) and simultaneously diminishing the effects of sequence characteristics within a gene (Fig. 1a). By comparing the variance of read counts across replicates at the gene level with that at the bin level, we show that the cross-sample variance is much less at the gene level than at the bin level in all three datasets (Fig. 1b).
RADAR models the read count distribution using a Poisson random effect model instead of a negative binomial distribution, which is commonly used in RNA-seq analysis [13, 15, 17] as well as in DRME and QNB for MeRIP-seq analysis [9, 10]. Negative binomial distribution-based models assume a quadratic relationship between mean read counts and their variance across all genes. We observe in real m6A-seq datasets that the mean-variance relationship of post-IP counts across genes significantly differs from that of regular RNA-seq counts (i.e., pre-IP counts). The former does not always follow a similar quadratic curvature and can exhibit very different patterns of variability (Fig. 1c, Additional file 1: Figure S4). To overcome these limitations, RADAR applies a more flexible generalized linear model framework (see the “Material and methods” section) that captures variability through random effects.
Another important advancement of RADAR, compared to existing MeRIP-seq data analysis tools [8,9,10,11], is the flexibility to incorporate covariates and permit more complex study design. Phenotypic covariates such as age and gender as well as experimental covariates such as batch information are often encountered in epitranscriptomic profiling studies with heterogenous patient samples. Covariates such as litter and age are common in experimental animal studies. For example, in the ovarian cancer dataset, the age of the tissue donors is partially confounded with predictor variable—disease status. In the T2D islets dataset, the variance of the first two principal components is confounded with the sequencing batch (Fig. 1d). After regressing out the batch effect, the remaining variance can be better explained by disease status (Fig. 1e). This indicates the importance of controlling for potential confounding factors when performing differential methylation tests. The generalized linear model framework in RADAR allows the inclusion of covariates and offers support for complex study designs.
Comparative benchmarks of different methods using simulated datasets
To evaluate the performance of RADAR in comparison to current methods, we applied RADAR and other methods for MeRIP-seq differential analysis including exomePeak, Fisher’s exact test, MeTDiff, and QNB on simulated datasets. We considered four scenarios: the proposed random effect model with/without covariates and the quad-negative binomial (QNB) model adopted from QNB [9, 10] with/without covariates. For each scenario, we evaluated the sensitivity and false discovery rate (FDR) of different methods using ten simulated copies. We first simulated a dataset of eight samples using the random effect model (“Materials and method” section Eq. (1), denoted as the simple case). The INPUT library was directly drawn from the T2D dataset. We simulated IP read count adjusted for pre-IP expression level of each bin according to Eq. (1) where μ is equal to mean log read count in the “control” group of T2D dataset. The final IP read counts were obtained by rescaling simulated data by the average IP/INPUT ratio observed in the T2D data. In total, we simulated three datasets of 26,324 sites in which 20% of sites are true positives with effect sizes of 0.5, 0.75, or 1, respectively.
For DM loci with an effect size of 0.5, RADAR achieved 29.1% sensitivity and 12.0% FDR at an FDR cutoff of 10%. At the same cutoff, exomePeak and Fisher’s test achieved 72.8% sensitivity/52.5% FDR and 72.2% sensitivity/50.5% FDR, respectively. MeTDiff achieved 10.5% sensitivity and 16.2% FDR. QNB, on the contrary, did not own any power for the small effect size. When the effect size increased, RADAR achieved much higher sensitivity, 77.8% for an effect size of 0.75 and 95.7% for an effect size of 1, while FDR were well calibrated at 10.4% and 10.1%, respectively. exomePeak and Fisher’s test both achieved 89% and 96% sensitivity for effect sizes of 0.75 and 1, respectively, but at the cost of unsatisfactory FDRs, which were greater than 46%. MeTPeak exhibited well-calibrated FDR (12.3% and 11.4%) and moderate sensitivity of 50.4% and 81.5% for effect sizes of 0.75 and 1, respectively. QNB only had low power for an effect size of 1 (beta = 1, 13.9% sensitivity and 0.5% FDR). Overall, for the simple case without covariates, RADAR achieved high sensitivity while maintained low FDR at varying true effect sizes (Fig. 2a). We then applied the above analysis at varying FDR cutoff and found RADAR achieved the highest sensitivity at a fixed level of empirical FDR (Additional file 1: Figure S5A). We note exomePeak and Fisher’s test achieved high sensitivity at all effect sizes as combining read counts across replicates of the same group helped to gain power. As a tradeoff, failing to account for within-group variability resulted in high FDR. On the contrary, RADAR and MeTDiff exhibited well-calibrated FDR while achieved high sensitivity at same levels as exomePeak for large effect sizes. QNB was overconservative and possessed little power.
We next applied the aforementioned methods to the proposed model with a covariate (effect size equal to 2, denoted as the difficult case) (Fig. 2b). As a result, at an FDR cutoff of 10%, RADAR achieved 38.4%, 79.7%, and 95.7% sensitivity with empirical FDRs slightly higher than those in the simple case (18.2%, 14.4%, and 13.7% for effect sizes of 0.5, 0.75, and 1, respectively). MeTDiff, with similar performance as RADAR in the simple case, lost power in the difficult case due to incapability of accounting for confounding factors. exomePeak, Fisher’s test, and QNB behaved similarly as in the simple case. The advantage of RADAR over other methods is robust to the choice of FDR cutoff as shown in Additional file 1: Figure S5B. In summary, RADAR outperformed existing alternatives in both cases.
Taking the covariate model with a DM effect size of 0.75 as an example, we also checked the distributions of effect size estimates and p values obtained from each method. In all methods, effect sizes were overall correctly estimated with estimates for “true” sites centered at 0.75 (Additional file 1: Figure S6A) and that for null sites centered at zero (Additional file 1: Figure S6B). However, we note the distribution of beta estimates is narrower for RADAR, especially in the difficult case, suggesting a more confident estimation. p values of exomePeak and Fisher’s test at null sites were enriched near zero, indicating over-detection of false-positive signals (Additional file 1: Figure S6C). We also observed many large p values obtained by QNB for “true” sites in both cases and MeTDiff in the difficult case, which suggested a high false-negative rate (Additional file 1: Figure S6D).
We then repeated simulation studies using the QNB model. Instead of setting the variances of INPUT and IP libraries equal as presented in the QNB paper, we let the variance of IP read count be larger than that of INPUT. This setting better reflects our observation in the real data as extra noise can be introduced during immunoprecipitation process for IP reads generation (Additional file 1: Figure S4). In the simple case without covariates, RADAR exhibited the lowest empirical FDR (18.9% and 18.5%) despite slightly lower sensitivity comparing to other methods (73.5% and 82.3%) when the effect sizes were relatively large (for effect sizes of 0.75 and 1). QNB performed better when the effect size was small with 58.6% sensitivity and 15.6% FDR for an effect size of 0.5 (Fig. 2c). The results were consistent when we evaluated their performance with different FDR cutoffs. Overall, QNB performed slightly better than RADAR with an effect size of 0.5. RADAR achieved similar sensitivity but better calibrated FDR when effect sizes equal to 0.75 and 1 (Additional file 1: Figure S5C). In the model with covariates, RADAR exhibited the lowest empirical FDR, with 25.8%, 23.0%, and 22.5% at effect sizes of 0.5, 0.75, and 1, respectively, while other methods either failed to detect the signal or had a higher empirical FDR. Specifically, MeTDiff had sensitivity below 0.5% at varying effect sizes and QNB reached FDRs of 64.1%, 55.8%, and 50.5% for effect sizes of 0.5, 0.75, and 1, respectively, at an FDR cutoff of 10% (Fig. 2d). The advantage of RADAR over alternative methods hold in the difficult case at varying cutoffs (Additional file 1: Figure S5D). In summary, RADAR outperformed other existing methods in most scenarios, particularly when covariates were present.
Comparative benchmarks of different methods using four real m6A-seq datasets
Next, we compared the performance of different methods using four real m6A-seq datasets: ovarian cancer (GSE119168), T2D (GSE120024), mouse liver (GSE119490), and mouse brain (GSE113781). To evaluate the sensitivity of different methods, we first checked the distributions of p values obtained from corresponding DM tests (Fig. 3). In the ovarian cancer, T2D, and mouse liver data, Fisher’s test and exomePeak detected the most signals as the p values are most dense near zero. In these three datasets, RADAR also returned a desirable shape for the p value histogram in which p values were enriched near zero while uniformly distributed elsewhere. MeTDiff returned a desired shape only in the ovarian cancer and mouse liver datasets. QNB were overconservative in the ovarian cancer and T2D dataset. All methods failed to return enriched p values near zero for the mouse brain dataset, suggesting there was no or little signal in this dataset. This is consistent with the original publication that very few differential peaks were detected in this study [7].
To ensure that well-performed methods achieved high sensitivity while maintaining a low FDR, we further performed permutation analyses to obtain the null distribution of p values for each dataset. Specifically, we shuffled the phenotype labels of samples such that the new labels were not associated with the true ones or any other important confounding factors. We expected the p values from a permutation test to follow a uniform distribution and the enriched p values near zero would be considered as false discoveries. For each dataset, we combined test statistics from 15 permuted copies and compared their distribution with the original tests (Fig. 4). p values from Fisher’s test and exomePeak were strongly enriched near zero and only slightly lower than those from the original tests. This suggests the strong signals detected by these two methods are likely to be false discoveries, consistent with the conclusion from simulation analysis. On the contrary, the histograms of p values from RADAR were close to flat in all datasets, indicating that strong signals detected by RADAR were more likely to be true. MeTDiff exhibited well-calibrated p values in the ovarian cancer and T2D data but enriched for small p values in the mouse liver data with an indicated high FDR. QNB test returned conservative p value estimates in all datasets. Taking together these analyses, we demonstrated that RADAR outperforms the alternatives by achieving high sensitivity and specificity simultaneously in real datasets.
To better demonstrate that RADAR detects DM sites with better sensitivity and specificity in real data, we show examples of DM site that is only detected by RADAR as well as likely false discovery sites identified by exomePeak and Fisher’s test but not by RADAR in the T2D dataset. We plot sequence coverage of individual samples for the DM sites in the RNF213 gene (Additional file 1: Figure S7A) and show despite large variability in control samples, m6A enrichment of T2D samples is consistently lower on this locus. Conversely, in the bogus DM sites detected by alternative methods (Additional file 1: Figure S7B, C), enrichment differences are mainly driven by one or two outlier samples in one group.
To further demonstrate the advantage of using gene-level read counts over local read counts to account for RNA expression level, we repeated the above analysis using post-IP counts adjusted by the local read counts of INPUT. We showed that in the T2D dataset, gene-level adjustment not only enabled stronger signal detection, but also lowered FDR as we observed that the permutation analysis using local count adjustment resulted in undesired stronger signals around zero in the p value histogram (Additional file 1: Figure S8). In the ovarian cancer and the mouse liver datasets, local count adjustment achieved higher signal detection but at the cost of a higher FDR. This analysis suggested that using gene-level read counts as the estimates of pre-IP RNA expression levels could effectively reduce FDR and lead to more accurate DM locus detections.
Attributed to the robust representation of pre-IP RNA expression level using gene-level read counts, RADAR’s performance is more robust to the sequencing depth of INPUT samples. To demonstrate this, we applied RADAR on data created by sub-sampling the read counts of INPUT samples in the T2D dataset so that the sequencing depth is half of the full dataset (average 17.5 million reads). We compared the DM sites detected in the reduced dataset with the results obtained from the full dataset (Additional file 1: Figure S9A). Using a 10% FDR cutoff, RADAR-detected DM sites in the reduced dataset showed the highest overlap with that in the full dataset. MeTDiff and QNB only had a few overlapping DM sites between the sub-sampled and full dataset. Fisher’s test and exomePeak had slightly fewer overlaps comparing to RADAR but had more false discoveries. We further compared the log fold change (logFC) estimates from reduced and full datasets to check their consistency. As a result, we found reduced sequencing depth had the least impact on the logFC estimated by RADAR while the estimates by others are much less reproducible with a shallower sequencing depth (Additional file 1: Figure S9A).
Unlike earlier pipelines that perform DM tests only on peaks identified from peak calling, RADAR directly tests on all filtered bins and reports DM sites. To check if the DM sites reported by RADAR are consistent with known characteristics of m6A, we performed de novo motif search on these sites and found DM sites detected in ovarian cancer, mouse liver, and T2D datasets are enriched for known m6A consensus motif (Additional file 1: Figure S10A) [18], suggesting DM sites reported by RADAR are mostly true. We also examined the topological distribution of these DM sites by metagene analysis (Additional file 1: Figure S10B). The distributions in ovarian cancer and mouse liver datasets are consistent with the topological distribution of common m6A sites, indicating methylation changes that occurred in these two datasets were not spatially biased. Interestingly, DM sites detected in T2D dataset are strongly enriched at 5′UTR, suggesting T2D-related m6A alteration are more likely to occur at 5′UTR.
RADAR analyses of m6A-seq data connect phenotype with m6A-modulated molecular mechanisms
Finally, we investigated whether DM test results obtained from RADAR would lead to better downstream interpretation. In the ovarian cancer dataset, we performed KEGG pathway enrichment analysis on the differential methylated genes (DMGs) detected by RADAR (Fig. 5a). We found the detected DMGs were enriched with molecular markers related to ovarian cancer dissemination [19, 20]. For instance, we identified key regulators of the PI3K (enrichment p value 7.8 × 10−5) and MAPK pathways (enrichment p value 1.1 × 10−4), including hypo-methylated PTEN and hyper-methylated BCL2 (Additional file 1: Figure S11). Other notable DMGs include key markers of ovarian cancer such as MUC16 (CA-125) and PAX8, as well as genes that play key roles in ovarian cancer biology such as CCNE1 and MTHFR. Conversely, DMGs detected by MeTDiff were only enriched in three KEGG pathways (Fig. 5b), most likely due to its inadequate power. We showed through permutation analysis that exomePeak and Fisher’s test results included a significant portion of false positives and could lead to biased downstream interpretations.
In the T2D dataset, DMGs identified by RADAR were enriched in related pathways including insulin signaling pathways, type II diabetes mellitus, mTOR pathways, and AKT pathways (Additional file 1: Table S1), indicating a role that m6A might play in T2D. We further analyzed these DMGs in related pathways and found the methylome of insulin/IGF1-AKT-PDX1 signaling pathway been mostly hypo-methylated in T2D islets (Additional file 1: Figure S12). Impairment of this pathway resulting in downregulation of PDX1 has been recognized as a mechanism associated with T2D where PDX1 is a critical gene regulating β cell identity and cell cycle and promoting insulin secretion [21,22,23,24]. Indeed, follow-up experiment on a cell line model validated the role of m6A in tuning cell cycle and insulin secretion in β cells and animal model lacking methyltransferase Mettl14 in β cells recapitulated key T2D phenotypes (results presented in a separate manuscript, [25]). To summarize, RADAR-identified DMGs enabled us to pursue an in-depth analysis of the role that m6A methylation plays in T2D. On the contrary, due to the incapability to take sample acquisition batches as covariates, the alternative methods were underpowered to detect DM sites in T2D dataset and could not lead to any in-depth discovery of m6A biology in T2D islets. These examples suggest that MeRIP-seq followed by RADAR analysis could further advance functional studies of RNA modifications.
Validation of RADAR-detected DM sites by the SELECT method
Recently, Xiao et al. developed an elongation and ligation-based qPCR amplification method (termed SELECT) for single nucleotide-specific detection of m6A [26]. This method relies on mechanism different from antibody pull-down-based MeRIP-seq to detect m6A, making it a suitable method for validating DM sites discovered by RADAR analysis. We selected six DM sites (Additional file 1: Table S2) including two sites only detected by RADAR and four sites in genes important in β cell for experimental validation using the SELECT method. Among six validated sites, the β cells regulator PDX1 and RADAR-specific DM sites showed significant m6A level alteration with p values 0.009 and 0.017, respectively (Fig. 6). Three other sites, IGF1R in the insulin/IGF1-AKT-PDX1 signaling pathway, MAFA—another important regulator of β cell function, and RADAR-specific DM site in CPEB2, showed m6A changes consistent with RADAR result despite not reaching statistical significance. The sites in the TRIB3 gene are similarly methylated in control and T2D samples as measured by SELECT. Overall, five out of six experimentally validated sites were supported by orthogonal evidence by SELECT, confirming the reliability of RADAR-detected differential methylation sites.