Benchmark strategy
Currently, twelve analytic tools have been developed to detect DNA methylation using ONT direct sequencing (Table 1). Among them, ten tools are compatible with R9.4 series flow cells, and nine of these ten can predict 5-methylcytosine (5mC). We compared the performance of those seven state-of-the-art methylation-calling tools targeting 5mCs in different CpG contexts; those seven tools are all compatible with the most favored ONT flow cell version (R9.4 and R9.4.1 pores): Nanopolish [9], Megalodon [36], DeepSignal [35], Guppy [32, 51], Tombo/Nanoraw (referred to as Tombo) [20], DeepMod [34], and METEORE [38] (Fig. 1B). Tombo is statistics-based while the other six tools are model-based. METEORE combines predictions from two or more tools that showed improved accuracy over individual tools using random forest (RF) models or multiple linear regression models. We chose the METEORE RF model combining Megalodon and DeepSignal as it achieved lower root mean square error (RMSE) than other available METEORE models [38]. We excluded SignalAlign [39], as its repository has not been updated for over 4 years. We also excluded DeepMP [37], as its repository is still under development. We developed the following three-step standardized workflow for benchmarking (Fig. 1B, C):
Step 1. Basecalling and quality control
To translate raw signal data into nucleotide sequences, we conducted the basecalling step using Guppy (v4.2.2). Then we used NanoPack [52] for data visualization and processing, to assess the read-length and basecalling quality, and to demultiplex sequencing data for downstream analysis. Together, the four ONT datasets exhibited median read lengths ranging from 3756 to 6524 bp, and median base quality ranging from 9.8 to 13 (Fig. 2A, B). The proportion of long reads (> 10,000 bp) is higher in the NA19240 dataset (36.75%) than in the other three datasets (median proportion = 32.29%), due to library preparation differences (See “Methods” for more details). We assessed CpG sites located in singletons and non-singletons (Additional file 1: Fig. S1A), and biologically relevant genomic contexts including gene bodies and CpG islands (Additional file 1: Fig. S1B), different CpG densities, and repetitive regions. The distribution of CpG sites in different regions is shown in Fig. 2C–F, Additional file 1: Fig. S2, and Additional file 2: Table S1.
Step 2. Genome assembly and polishing
We aligned the basecalled reads to human genome assembly GRCh38/hg38 using minimap2 [53]. Basecalling a squiggle, i.e., translating the electric current signal from a nanopore read into a DNA sequence, typically contains some errors when comparing the resulting sequence to a reference sequence [54]. The Tombo re-squiggle algorithm refines the assignment from a squiggle to a reference sequence after basecalling and alignment. The refined basecalled reads and alignment by this re-squiggle algorithm is required by Tombo and DeepSignal for DNA methylation calling.
Step 3. Methylation calling and evaluation
We detected 5mCs in different CpG contexts using each of the seven methylation-calling tools based on the corresponding recommended parameters. Specifically, Guppy recommended using the ONT fast5Mod program [55] to extract the methylation calling information at the site level from the basecalling output (Fig. 1B). We then designed three performance-evaluation criteria (Fig. 1C) to benchmark the performances of each methylation-calling tool and used bisulfite sequencing (BS-seq, coverage ≥ 5) data to determine the ground truth. First, we evaluated the per-read performance of the 5mC prediction, i.e., at single-molecule, single-base resolution, based on fully methylated or fully unmethylated CpG sites across various genomic contexts. The performance metrics included F1 score, accuracy, receiver operating characteristic curves (ROC curves), and area under the ROC curve (AUC). Second, we assessed the per-site performance of the 5mC prediction. Specifically, we measured the 5mC percentage correlation coefficient between nanopore sequencing and BS-seq across all CpG sites at the human whole-genome level. Furthermore, we evaluated the relationship between CpG methylation percentage and distance to the annotated transcription start site (TSS) or CCCTC-binding factor (CTCF) binding sites. Third, we assessed the running speed and resource usage evaluation. Further details on performance criteria used in the evaluation are shown in “Methods.”
Benchmark datasets
We used four datasets for benchmarking: nanopore sequencing of the human B-lymphocyte cell lines NA19240 (referred to as NA19240, R9.4.1) [56] and NA12878 (referred to as NA12878, R9.4) [57], the human leukemia cell lines K562 (referred to as K562, R9.4.1), and a human primary acute promyelocytic leukemia clinical specimen (referred to as APL, R9.4.1).
For nanopore sequencing, we used published high-coverage nanopore sequencing datasets for the cell line NA19240 (~ 32× sequencing coverage) from the 1000 Genomes Project [56], and the cell line NA12878 (~ 26× sequencing coverage) from Whole Human Genome Sequencing Project [57], and generated nanopore sequencing datasets for K562 and APL with ~ 1–3× coverage. For DNA methylation ground truth, we used the published NA12878 and K562 whole-genome bisulfite sequencing (WGBS) datasets, and the NA19240 reduced representation bisulfite sequencing (RRBS) dataset from the Encyclopedia of DNA Elements (ENCODE) [58]. We also generated WGBS and oxidative bisulfite sequencing (oxBS-seq) data for APL. More details can be found in “Methods.”
Per-read performance of 5mC prediction
Nanopore sequencing can detect cytosine-methylation state for individual molecules. We assessed the per-read performance of the seven DNA-methylation-calling tools at single-molecule, single-base resolution in singletons and non-singletons. We compared methylation calling performances on fully methylated or fully unmethylated CpGs using BS-seq as ground truth across the four datasets (Additional file 2: Table S2). We divided non-singletons into two sub-categories: (1) concordant non-singletons: non-singletons contain CpGs that are either fully methylated or fully unmethylated, and (2) discordant non-singletons: non-singletons that contains both fully methylated and fully unmethylated CpGs. Nanopolish, Megalodon, DeepSignal, and Guppy outperformed the other three tools on all datasets measured by F1-score, accuracy, and AUC (Fig. 3, Additional file 1: Fig. S3, and Additional file 2: Table S3). Notably, all tools exhibited lower F1 scores (less than 0.90, Fig. 3A) and accuracy (less than 0.93, Additional file 1: Fig. S3B) at discordant non-singletons than at any of the other CpG contexts, consistently across four datasets (Additional file 1: Fig. S3C). Also, all methods achieved higher performance on concordant non-singletons than singletons. The observation may be relevant to the fact that model-based methylation-calling tools (e.g., Nanopolish, DeepSignal, DeepMod, and METEORE) used “concordant” training data—completely methylated sequences and completely unmethylated sequences. Moreover, Nanopolish and Tombo borrow the signals of neighboring CpG sites to call DNA methylation.
Different genomic contexts display different CpG densities and DNA methylation levels [59]. Thus, to evaluate the impact of biologically relevant genomic contexts on 5mC predictions, we considered promoters, exons, introns, intergenic regions (referred as intergenic), CpG islands, shores, and shelves (Fig. 4A, Additional file 1: Fig. S4A, Additional file 2: Table S3), regions with different CG densities (Fig. 4B, Additional file 1: Fig. S4B), and different types of repetitive regions (Fig. 4C, Additional file 1: Fig. S4C). All seven tools exhibited a lower F1 score (< 0.93) for intergenic regions than for any other genic regions or CpG islands, shores, and shelves (Fig. 4A). We next assessed if CG density impact the performance of 5mC predictions using nanopore sequencing (Fig. 4B). Specifically, CG density is calculated by the percentage of G and C bases in 5-base windows. Tombo and METEORE suffered from low accuracy predictions in all CG density regions, but particularly so in low CG density regions. CG density significantly associated with the performance of Nanopolish, Megalodon, DeepSignal, Guppy, and Tombo with p value < 0.05 by the analysis of variance (ANOVA) test (Additional file 2: Table S4). Moreover, we examined five categories of repetitive regions: short interspersed nuclear element (SINE), long interspersed nuclear element (LINE), long terminal repeat (LTR), DNA transposons, and others (Fig. 4C). Nanopolish, Megalodon, DeepSignal, Guppy, and Tombo showed lower F1 scores for SINE and LTR regions than for the other repetitive regions. Compared to the other tools, Nanopolish, Megalodon, DeepSignal, and Guppy consistently exhibited higher overall F1 scores on CpG sites across all datasets and across genic and intergenic regions, repetitive regions, and regions with different CG densities (Fig. 4).
DeepMod ranked the lowest in F1 score, accuracy, and AUC, when applied to all four human ONT datasets (Additional file 2: Table S3) across different genomic contexts (Figs. 3 and 4), while it is comparable to the other six tools when using the 5mC positive control dataset from E. coli [38] (Additional file 2: Table S5), suggesting the importance of evaluating the performance of these analytic tools using human ONT datasets, since not all tools are compatible with genomes with higher complexity than that of E. coli, such as the human genome.
To determine whether the performance of the tools at genomic regions in general is impacted by the percentage of non-singletons in the regions, we assessed the percentage of singletons and non-singletons for each genomic context (Additional file 2: Table S6). In addition, we tested the significance of the Pearson correlation coefficient between F1 score achieved by each tool and non-singleton percentage, and the relationship was statistically significant (p value < 0.05, correlation coefficients range from 0.286 to 0.423, Additional file 2: Table S7) for three tools, i.e., Nanopolish, DeepSignal, and Guppy. These observations suggest that the various 5mC prediction performances across different genomic contexts are significantly influenced by the distribution of singletons and non-singletons in three tools. In summary, Nanopolish, Megalodon, DeepSignal, and Guppy outperformed the other tools in per-read performance of 5mC prediction across genomic contexts.
Per-site performance of 5mC prediction
To assess the performance of the seven tools for CpG sites with a full range of methylation levels, we evaluated the Pearson correlation coefficient between DNA methylation percentage from nanopore sequencing (read coverage ≥ 3) and from the corresponding BS-seq data (coverage ≥ 5), both at single-base resolution. To obtain per-site DNA methylation percentages, we either obtained DNA methylation reports directly from each tool or followed the instruction of each tool to aggregate per-read 5mC predictions or to obtain the fraction of methylated reads. We found that the 5mC percentage predicted by Nanopolish, Megalodon, DeepSignal, and Guppy showed the highest correlation (≥ 0.80) with all four datasets (Fig. 5A and Additional file 2: Table S8).
BS-seq of all datasets exhibited a bimodal distribution of DNA methylation (0 for unmethylated, 1 for methylated), and the histogram of the DNA methylation output of Nanopolish, Megalodon, DeepSignal, and Guppy also displayed a similar bimodal distribution on NA19240 (Fig. 5A). Similar predicted methylation patterns and performance of Nanopolish, Megalodon, and DeepSignal for NA12878 were observed by recent research [43]. In contrast, the DNA methylation-level histogram of Tombo and METEORE showed multiple peaks between 0 and 100% methylation levels, rather than two peaks. Furthermore, the Pearson correlation between BS-seq and DeepMod was close to zero for NA19240 data (Fig. 5A), confirming that DeepMod cannot effectively predict methylation distribution at the human whole-genome level. Moreover, Nanopolish, Megalodon, DeepSignal, and Guppy consistently produced the highest correlation coefficients at all genic and intergenic regions, CG density regions, and repetitive regions for NA19240 data (Additional file 1: Fig. S5A-B).
Furthermore, we found that the correlation among the top four performers, i.e., Guppy, Nanopolish, Megalodon, and DeepSignal, is greater than the correlation between BS-seq and nanopore sequencing data (Fig. 5A, Additional file 2: Table S8). We hypothesize that the ability to distinguish 5hmC from 5mC by nanopore sequencing, at least in part, contributes to the discrepancy between BS-seq and nanopore sequencing. To test the hypothesis, we generated oxBS-seq for APL (See “Methods”) and detected the 5hmC percentage for each CpG site by integrating matched APL oxBS-seq and WGBS data by the MLML (maximum likelihood methylation levels) algorithm [60]. We compared the 5hmC percentage in CpGs exhibiting agreement (methylation level difference < 5%) and discrepancy (methylation level difference > 40%) between WGBS and nanopore sequencing for APL. The CpG sites exhibiting discrepancy showed a significantly higher 5hmC level than those exhibiting agreement (p value < 0.0001, Wilcoxon rank sum test, Additional file 1: Fig. S6A-D). These observations suggest that the ability to distinguish 5hmC from 5mC by nanopore sequencing enables a more accurate 5mC detection than BS-seq.
To assess the impact of biological context on the methylation predictions, we explored the relationship between CpG methylation percentage and the distance to annotated TSSs. As expected, CpG sites near TSSs tended to be unmethylated. Methylation levels increased as the distance from the TSS increased. DNA methylation patterns detected from nanopore sequencing via Nanopolish, DeepSignal, and Megalodon closely resembled the pattern for the WGBS data (Fig. 5B, Additional file 1: Fig. S7A-B, and Additional file 2: Table S9). Notably, Guppy displayed the lowest DNA methylation levels at TSSs. Furthermore, methylated cytosines affect DNA-binding specificities of hundreds of human transcription factors [61]. Binding sites of the transcription factor CTCF are characterized by low DNA methylation levels [62]. CTCF plays a critical role in long-range chromatin interactions, the formation and maintenance of topologically associated domains, and transcription. Thus, we assessed the relationship between CpG methylation percentage and the distance to the center of the CTCF binding peaks from the ChIP-seq data of the matching cell lines (NA19240, NA12878, and K562). Indeed, DNA methylation percentage was lowest at the center of the CTCF binding peaks, and the ONT 5mC predictions by Nanopolish, Megalodon, DeepSignal, and Guppy closely tracked the pattern of WGBS data (Fig. 5C, Additional file 1: Fig. S7C, and Additional file 2: Table S9).
Overall, Nanopolish, Megalodon, DeepSignal, and Guppy had high correlations with BS-seq, and they closely tracked the methylation patterns of BS-seq at the whole-genome level. METEORE, which models 5mC by integrating the output of Megalodon and DeepSignal, did not perform well in any evaluation criteria that we assessed. The correlation coefficient of DNA methylation across CpG sites between the seven tools and BS-seq is consistent with the read-level accuracy (Figs. 3 and 4).
Megalodon and DeepSignal predicted more CpG sites than did Nanopolish and Guppy at the site level
Though CpG sites are the same for all the tools after the basecalling and alignment step, the predicted number of CpG sites is different because each methylation-calling tool has their own criteria to make confident methylation predictions. Therefore, we next evaluated the number of CpG sites (read coverage ≥ 3) with 5mC predictions. The UpSet diagram shows the number of overlapped sites among the tools (Fig. 6 and Additional file 1: Fig. S8). Compared to the other five tools, Megalodon and DeepSignal covered more CpG sites on all four datasets. Of the predicted CpG sites in NA19240, 50% of the sites were predicted by all tools (Additional file 2: Table S10). Furthermore, among all the CpG sites predicted collectively by the top four performers (i.e., Nanopolish, Megalodon, DeepSignal, and Guppy), 92% and 85% of the sites were predicted by all four tools for the NA19240 and NA12878 datasets, shown by the Venn diagrams (Fig. 6 and Additional file 2: Table S11). Megalodon and DeepSignal predicted the highest number of CpG sites, i.e., they each predicted 99% of the union of CpG sites using the NA19240 dataset, while Nanopolish and Guppy predicted 93% and 95%, respectively, of the union CpG sites in that dataset, due to the more stringent criterion of log-likelihood ratio cutoffs [9, 32, 51, 63]. Thus, Megalodon and DeepSignal covered 6% and 4% more CpG sites than did Nanopolish and Guppy, and the differences increase greatly for lower sequencing-depth ONT datasets (APL and K562, Additional file 1: Fig. S8A-B). In summary, among the top four performers, Megalodon and DeepSignal predicted the largest number of CpG sites. Also, Tombo and DeepMod predicted the fewest CpG sites.
Running time and memory usage on benchmark datasets
To evaluate the running time and peak memory of each methylation-calling tool, we ran seven pipelines starting from the initial stage of taking input of raw fast5 files to the final output of the read-level and genome-level prediction results using the same high-performance computing (HPC) platform and environment for all seven tools (See “Methods”). We then split the raw ONT data to parallelize methylation calling for each tool. To minimize run time, a GPU and eight processors of hardware resources were allocated to each job for the methylation-calling tools. The SLURM resource and job management system effectively monitor the use of computing resources on HPC clusters [64]. Therefore, for each tested dataset we ran all jobs managed by SLURM and calculated the sums of CPU utilized times (hours), the max of job wall-clock times (hours), and the peak memory use (GB) based on reported logs of SLURM jobs for each pipeline (Fig. 7, Additional file 2: Table S12). We noted that only METEORE depends on the 5mC prediction of other tools’ results (e.g., Megalodon and DeepSignal), and thus the actual running time is the sum of the running times of METEORE, Megalodon, and DeepSignal. Guppy processed the fast5 raw signal file for NA19240 (~ 32× coverage) in the shortest amount of CPU time (151 h), followed by Megalodon and Nanopolish (698 and 703 h), while the CPU times for Tombo, DeepSignal, and DeepMod for the same file were much longer than the time required by Guppy (42×, 150×, and 186× longer, respectively). Furthermore, Guppy and Nanopolish exhibited the lowest peak memory usage (~ 13 and 19 GB), while Megalodon exhibited the highest peak memory usage (17× higher than that of Guppy). The same analysis of run time and peak memory usage for the other three benchmark datasets confirmed the ranking for these tools (Additional file 2: Table S12). In conclusion, Guppy and Nanopolish required both the least amount of CPU time and exhibited the lowest peak memory usage. DeepSignal and Tombo consumed more CPU times, but low peak memory, while Megalodon consumed large peak memory but short CPU time. METEORE and DeepMod both require the highest peak memory and CPU running time.