Canalization of genome-wide transcriptional activity in Arabidopsis thaliana accessions by MET1-dependent CG methylation

Background
 Despite its conserved role on gene expression and transposable element (TE) silencing, genome-wide CG methylation differs substantially between wild Arabidopsis thaliana accessions. Results To test our hypothesis that global reduction of CG methylation would reduce epigenomic, transcriptomic, and phenotypic diversity in A. thaliana accessions, we knock out MET1, which is required for CG methylation, in 18 early-flowering accessions. Homozygous met1 mutants in all accessions suffer from common developmental defects such as dwarfism and delayed flowering, in addition to accession-specific abnormalities in rosette leaf architecture, silique morphology, and fertility. Integrated analysis of genome-wide methylation, chromatin accessibility, and transcriptomes confirms that MET1 inactivation greatly reduces CG methylation and alters chromatin accessibility at thousands of loci. While the effects on TE activation are similarly drastic in all accessions, the quantitative effects on non-TE genes vary greatly. The global expression profiles of accessions become considerably more divergent from each other after genome-wide removal of CG methylation, although a few genes with diverse expression profiles across wild-type accessions tend to become more similar in mutants. Most differentially expressed genes do not exhibit altered chromatin accessibility or CG methylation in cis, suggesting that absence of MET1 can have profound indirect effects on gene expression and that these effects vary substantially between accessions. Conclusions Systematic analysis of MET1 requirement in different A. thaliana accessions reveals a dual role for CG methylation: for many genes, CG methylation appears to canalize expression levels, with methylation masking regulatory divergence. However, for a smaller subset of genes, CG methylation increases expression diversity beyond genetically encoded differences. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02833-5.


Feature intersections between DEGs, CG-DMRs and HV-dACRs
To intersect positions of each consensus DEG set (TE-DEG/Non-TE-DEG) with CG-DMRs, we identified the five closest CG-DMRs to each DEG. Based on their proximity to the DEG, each of these hits were classified as 'extended gene-body' (distances within 100bp upstream of the TSS and 100bp downstream of the TTS), 'cis upstream/downstream' (within 1.5kb upstream/downstream of the gene body) and 'trans' if not falling within the above classes. All DEGs with CG-DMR hits under the 'trans' category were filtered out. DEGs with unique or multiple 'gene-body' CG-DMRs were classified as 'GB' , while DEGs with unique/multiple CG-DMRs in gene-body and cis, or only in cis were classified as 'cis'.
Subsequently, gene expression counts and methylation counts for the corresponding CG-DMRs were extracted for these DEGs. Since many DEGs had multiple CG-DMRs associated with them, we only retained CG-DMRs which showed the highest difference in methylation level between mutant and WT for each mutant genotype, thereby aiming to represent only the strongest methylation signals that could explain gene expression differences.
For intersecting consensus DEGs with HV-dACRs, we used a similar approach, where the closest five HV-dACRs to each DEG were identified. After filtering out DEGs with HV-dACRs in 'trans', we retained all remaining DEGs (even with multiple 'gene-body' and 'cis' HV-dACRs) in a single category called 'cis'. For each DEG, only HV-dACRs with the highest difference in accessibility between mutant and WT for each mutant genotype were retained.
A more detailed explanation of the above methods is provided below :

Intersects between DEGs and CG-DMRs:
Consensus DEGs (from all accessions) were first split as 5731 Non-TE-DEGs and 1401 TE-DEGs. Next, each set was intersected with positions of consensus CG-DMRs (generated from all samples), using bedtools closest, to find the top 5 closest CG-DMRs to each DEG.
#Example of command used (to find 5 closest hits of CG-DMRs to DEGs) bedtools closest -a Consensus_NonTE-DEGs_coord.tab -b CG_DMRs_coord.bed -D a -k 5 > Consensus_NonTEDEGs_closestk5_CGDMRs.bed All CG-DMR hits under the 'trans' category were filtered out. Next, the number of duplicate CG-DMR hits for each gene were counted. Each gene was further classified into the following categories: 'GB unique' (single CG-DMR hit in gene body), 'GB multi' (multiple CG-DMR hits in gene body), 'cis upstream' (single CG-DMR hit in cis upstream), 'cis downstream' (single CG-DMR hit in cis downstream), 'cis multi' (multiple cis CG-DMRs) and 'GB and cis' (multiple CG-DMRs in gene body and cis).
Genes classified as 'GB unique' and 'GB multi' were pooled together in a 'GB all' category.
All remaining categories were pooled as a 'multiple cis regulatory' category. Gene expression counts were obtained for genes in both pooled categories (for 158 samples and reduced to 73 samples to match the BS-seq dataset). Similarly, methylation levels for the CG-DMR corresponding to each gene were also obtained for 73 samples.
Next, each gene was scanned to examine duplicate CG-DMR hits and their methylation values. For each of the 73 samples, the CG-DMR carrying the highest difference between wild-type and mutant methylation levels was retained. These genes were subsequently plotted based on their methylation differences and gene expression differences.

Consensus_NonTEDEGs_closestk5_HV-dACRs.bed
All HV-dACR hits under the 'trans' category were filtered out. All other genes and corresponding HV-dACR hits were pooled as a 'multiple cis regulatory' category. Gene expression counts were obtained for genes in each category (for 158 ATAC-seq samples and reduced to 73 samples to match the methylation dataset. Similarly, ATAC-seq accessibility levels (TMM) for the corresponding HV-dACR hit in each gene was also obtained for 158 samples and reduced to 73 samples.
Next, each gene was scanned to examine duplicate HV-dACR hits and their accessibility values. For each of the 73 samples, the HV-dACR carrying the highest difference between wild-type and mutant accessibility levels was retained. These genes were subsequently plotted based on these accessibility differences and gene expression differences.