Author Correction: The SEQC2 epigenomics quality control (EpiQC) study

Background: Cytosine modifications in DNA such as 5-methylcytosine (5mC) underlie a broad range of developmental processes, maintain cellular lineage specification, and can define or stratify types of cancer and other diseases. However, the wide variety of approaches available to interrogate these modifications has created a need for harmonized materials, methods, and rigorous benchmarking to improve genome-wide methylome sequencing applications in clinical and basic research. Here, we present a multi-platform assessment and cross-validated resource for epigenetics research from the FDA ’ s Epigenomics Quality Control Group. Results: Each sample is processed in multiple replicates by three whole-genome bisulfite sequencing (WGBS) protocols (TruSeq DNA methylation, Accel-NGS MethylSeq, and SPLAT), oxidative bisulfite sequencing (TrueMethyl), enzymatic deamination method (EMSeq), targeted methylation sequencing (Illumina Methyl Capture EPIC), single-molecule long-read nanopore sequencing from Oxford Nanopore Technologies, and 850k Illumina methylation arrays. After rigorous quality assessment and comparison to Illumina EPIC methylation microarrays and testing on a range of algorithms (Bismark, BitmapperBS, bwa-meth, and BitMapperBS), we find overall high concordance between assays, but also differences in efficiency of read mapping, CpG capture, coverage, and platform performance, and variable performance across 26 microarray normalization algorithms. Conclusions: The data provided herein can guide the use of these DNA reference materials in epigenomics research, as well as provide best practices for experimental design in future studies. By leveraging seven human cell lines that are designated as publicly available reference materials, these data can be used as a baseline to advance epigenomics research. NEBNext Enzymatic Methyl-Seq (E7120, NEB) kit following the manufacturer ’ s instructions. using 4 100 were a 2200 HSD1000.

any aligner-specific biases. All four programs returned a nearly identical distribution of CpGs called throughout the genome. The highest genomic enrichment was detected at 5'UTRs, promoter regions, and exonic regions by all programs. Therefore, even though mapping efficiency and CpG depth was influenced by software, the genomic distribution of CpGs was reliably called by all softwares examined. As a result of these comparisons, outputs from bwa-meth were used for all downstream analyses.

5-hydroxymethylcytosine Detection
Total 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC) levels within each cell line examined in this study were measured by LC-MS/MS (Supplementary Table 6). The estimated percentage of 5hmC levels across all seven cell lines were below the limit of detection for this method.
In order to validate these results at base-level resolution, we used the NuGEN TrueMethyl oxBS-Seq library preparation kit (aka TrueMethyl), which allows investigators to measure 5mC and 5hmC in an indirect manner on the sequence level. For completeness, each cell line replicate was processed using both bisulfite only (BS = 5mC + 5hmC) and an oxidative reaction prior to sodium bisulfite treatment (OX = 5mC only).
Additional file 1: Figure S12 shows that all cell lines have a higher level of 5mC compared to 5hmC (Additional file 1: Figure S12a,b). The low 5hmC levels were also observed at the single-nucleotide resolution level, with similar correlations between the two library preparations across all cell lines (Additional file 1: Figure   S12c), and also within each cell lines (Additional file 1: Figure S12d), where the PCA plot shows little to no separation between libraries prepared using BS or OX protocols.
As stated above, preparation of BS and OX libraries in parallel allows the determination of 5mC, 5hmC and C. We used the MLML2R package to estimate the level of each cytosine state, for each CpG sequenced, using HG002 as example (Additional file 1: Figure S12e). The top panel shows that some CpG sites not only show 100% of a specific cytosine mark (C = 100% unmethylated CpG, mC = 100% methylated CpG), but also a mixture of two (mC_C = methylated or unmethylated C; hmC_C = hydroxymethylated or unmethylated C; mC_hmC = methylated or hydroxymethylated C) or of all cytosine mark (mC_hmC_C). Consistent with the LC-MS/MS quantitation, hmC marks were found in low proportions at some CpG sites. The results observed for HG002 were representative of all the 7 cell lines.

Biological Significance of Between-Family Trio Differential Methylation
To determine the biological relevance of our results, we considered 51 CpGs on Chromosome 1 that had been previously identified as differentially methylated in an array analysis of approximately 300 individuals from Caucasian-American, African-American, and Han Chinese-American populations [6]. Annotation and methylation results from all 51 CpGs are available within Supplementary Table 5. Of the 7 sites with reported |PMD|>0.2 (Percent Methylation Difference) between Chinese-Americans and Caucasian-Americans, all had corresponding |PMD|>0.2 within the microarray data. Additionally, 4 of these were identified as statistically significant DMAs across all six sequencing assays (five short read library types and Oxford Nanopore). Of the three remaining sites, the first (on the TAS1R3 promoter) was significantly hypomethylated in the Chinese family for EMSeq, Nanopore, SPLAT, and TrueMethyl, the second (on the PM20D1 promoter) had insufficient read coverage for TruSeq but was a DMA for the remaining assays, and the third (located on the C1orf100 promoter) was identified as a DMA for only SPLAT although estimated PMD values were greater than 0.1 for all assays. Notably, these sites were identified as methylation quantitative trait loci (meQTL) in the original analysis. In addition to TAS1R3, which is a sweetness taste receptor that is known to vary phenotypically between the Asian and Caucasian populations [7], there was strong concordance for 6 CpGs on the PM20D1 promoter, a gene associated with obesity and Alzheimer's disease with demonstrated population based variation [8,9].
We additionally reviewed the collection of 29,802 sites on Chromosome 1 that were identified as differentially methylated for four or more of the six sequencing assays. Following annotation with HOMER [10], analysis with DAVID [11] identified a subset of 133 genes associated with hypertension (Benjamini Hochberg adjusted p-value = 5.0E-13), 54 genes associated with osteoporosis (p = 5.0E-13), and 18 genes associated with atopic dermatitis (p =1.0E-5) according to the GAD database [12]. Only 1204 (4.0%) of these sites were included on the Infinium MethylEPIC array, and while annotation for these sites included 53 of the hypertension-associated genes (p=3.3E-4) and 9 of those associated with atopic dermatitis (p=0.03), only 17 of the genes identified with osteoporosis were included and this was an insufficient number to result in a significant association.

EMSeq Input Titration
In order to investigate the impact of input DNA on detection and characterization of CpG methylation, we generated EM-Seq libraries using 10ng, 50ng, and 100ng aliquots of input DNA for each replicate for each member of the Chinese Han Trio in this study (HG005-7). We then randomly subsampled each run in silico to a random set of 1M, 5M, 10M, 25M, 50M, and 100M paired end 150bp reads per input. At the lowest read input, the less complex 10ng library covered CpGs greater than 50ng and 100ng libraries, though beyond 25M paired end reads the more complex (50/100ng) libraries surpassed the 10ng library in mean CpG coverage (Additional file 1: Figure S13a). All three library types exhibited similar distributions of CpG coverage across read titrations, reflecting fringe technical noise contributing to mean depth differences at low inputs that were evened out with more input. This was further validated by looking at the intersection of CpGs covered by each input type at each read filtration titer, where by 10M paired end reads the majority of sites were shared by all libraries, and notably the lowest input consistently covered the fewest unique CpGs (Additional file 1: Figure S13c).

Methyl EPIC Capture Correlations
We compared the whole epigenome libraries to sequencing replicates of Illumina Methyl Capture EPIC, a reduced representation bisulfite approach interrogating roughly 3.3 million CpGs with a preference for CpG islands and promoter regions. Results shown for HG002 are representative of all seven genomes. Methylation percentage of CpGs within replicates of Capture EPIC were compared to shared sites among whole methylome assays as well as Nanopore sequencing, with good Pearson correlation for all comparisons (average r=0.85).
Capture EPIC tended to overestimate fully methylated sites that were estimated to be closer to 50-90% in other assays (Additional file 1: Figure S14s).
Using 20X downsampled methylation data, the shared CpG coverage on Chromosome 1 in Capture EPIC sites was highly consistent with overall methylome coverage (Figure 2). Nanopore missed the fewest sites covered by EPIC (n=5,179), while TruSeq missed the most (n=21,712).       Figure S4: Read retention rate. The fraction of total reads that are retained after each step of the epigenome alignment process is shown per assay. Properly mapped = both mates of a pair were mapped in the correct orientation within a 1kb distance. Dedup = removing reads that are marked as duplicates. MQ = Mapping Quality.