Sample collection and age estimation
The trees used in this study were located at Hood River Ranger District [Horse Thief Meadows area], Mt. Hood National Forest, 0.6 mi south of Nottingham Campground off OR-35 at unmarked parking area, 500′ west of East Fork Trail #650 across river, ca. 45.355313, − 121.574284 (Additional file 1: Fig. S1).
Cores were originally collected from the main stem and five branches from each of five trees in April 2015 at breast height (∼ 1.5 m) for standing tree age using a stainless-steel increment borer (5 mm in diameter and up to 28 cm in length). Cores were mounted on grooved wood trim, dried at room temperature, sanded, and stained with 1% phloroglucinol following the manufacturer’s instructions (https://www.forestry-suppliers.com/Documents/1568_msds.pdf). Annual growth rings were counted to estimate age. For cores for which accurate estimates could not be made from the 2015 collection, additional collections were made in spring 2016. However, due to difficulty in collecting by climbing, many of the cores did not reach the center of the stem or branches (pith) and/or the samples suffered from heart rot. Combined with the difficulty in demarcating rings in porous woods such as poplar Populus [57, 58], accurate measures of tree age or branch age were challenging (Additional file 1: Fig. S2).
Simultaneously with stem coring, leaf samples were collected from the tips of each of the branches from the selected five trees. Branches 9.1, 9.5, 13.4, 14.1, 15.1, and 15.5 were too damaged to determine reasonable age estimates and were removed from analysis. Branch 14.4 and the stems of 13.1 and 13.2 were estimated by simply regressing the diameter of all branches and stems that could be aged by coring.
Nuclei prep for DNA extraction
Poplar leaves that had been kept frozen at − 80 °C were gently ground with liquid nitrogen and incubated with NIB buffer (10 mM Tris-HCL, PH8.0, 10 mM EDTA PH8.0, 100 mM KCL, 0.5 M sucrose, 4 mM spermidine, 1 mM spermine) on ice for 15 min. After filtration through miracloth, Triton x-100 (Sigma) was added to tubes at a 1:20 ratio, placed on ice for 15 min, and centrifuged to collect nuclei. Nuclei were washed with NIB buffer (containing Triton x-100) and re-suspended in a small amount of NIB buffer (containing Triton x-100) then the volume of each tube was brought to 40 ml and centrifuged again. After careful removal of all liquid, 10 ml of Qiagen G2 buffer was added followed by gentle re-suspension of nuclei; then 30 ml G2 buffer with RNase A (to final concentration of 50 mg/ml) was added. Tubes were incubated at 37 °C for 30 min. Proteinase K (Invitrogen), 30 mg, was added and tubes were incubated at 50 °C for 2 h followed by centrifugation for 15 min at 8000 rpm, at 4 °C, and the liquid gently poured to a new tube. After gentle extraction with chloroform to isoamyl alcohol (24:1), then centrifugation and transfer of the top phase to a fresh tube, HMW DNA was precipitated by addition of 2/3 volume of iso-propanol and re-centrifugation to collect the DNA. After DNA was washed with 70% ethanol, it was air dried for 20 min and dissolved thoroughly in 1× TE.
Whole-genome sequencing
We sequenced Populus trichocarpa var. Stettler using a whole-genome shotgun sequencing strategy and standard sequencing protocols. Sequencing reads were collected using Illumina and PacBio. Both the Illumina and PacBio reads were sequenced at the Department of Energy (DOE) Joint Genome Institute (JGI) in Walnut Creek, California, and the HudsonAlpha Institute in Huntsville, Alabama. Illumina reads were sequenced using the Illumina HISeq platform, while the PacBio reads were sequenced using the RS platform. One 400-bp insert 2 × 150 Illumina fragment library was obtained for a total of ~ 349× coverage (Additional file 2: Table S11). Prior to assembly, all Illumina reads were screened for mitochondria, chloroplast, and phix contamination. Reads composed of > 95% simple sequence were removed. Illumina reads less than 75 bp after trimming for adapter and quality (q < 20) were removed. The final Illumina read set consists of 906,280,916 reads for a total of ~ 349× of high-quality Illumina bases. For the PacBio sequencing, a total of 69 chips (P6C4 chemistry) were sequenced with a total yield of 59.29 Gb (118.58×) with 56.2 Gb > 5 kb (Additional file 2: Table S12), and post error correction a total of 37.3 Gb (53.4×) was used in the assembly.
Genome assembly and construction of pseudomolecule chromosomes
The current release is version 1.0 release began by assembling the 37.3 Gb corrected PacBio reads (53.4× sequence coverage) using the MECAT CANU v.1.4 assembler [39] and subsequently polished using QUIVER v.2.3.3 [40]. This produced 3693 scaffolds (3693 contigs), with a scaffold N50 of 1.9 Mb, 955 scaffolds larger than 100 kb, and a total genome size of 693.8 Mb (Additional file 2: Table S13). Alternative haplotypes were identified in the initial assembly using an in-house Python pipeline, resulting in 2972 contigs (232.3 Mb) being labeled as alternative haplotypes, leaving 745 contigs (461.5 Mb) in the single haplotype assembly. A set of 64,840 unique, non-repetitive, non-overlapping 1.0-kb syntenic sequences from version 4.0 P. trichocarpa var. Nisqually assembly aligned to the MECAT CANU v.1.4 assembly and used to identify misjoins in the P. trichocarpa var. Stettler assembly. A total of 22 misjoins were identified and broken. Scaffolds were then oriented, ordered, and joined together into 19 chromosomes. In the syntenic marker FASTA file, each record identifier carried information pertaining to the Nisqually chromosome where the sequence was extracted, as well as the position in the chromosome. These markers, along with the annotated primary transcripts from Nisqually, were aligned to the Poplar var. 14.5 assembly using BLAT. The chromosome/position information was used to identify misjoins in the assembly. Once the misjoins were corrected, the scaffolds were ordered and oriented using the positional information contained in the syntenic markers/genes. A total of 117 joins were made during this process, and the chromosome joins were padded with 10,000 Ns [59]. Small adjacent alternative haplotypes were identified on the joined contig set. Alternative haplotype (Althap) regions were collapsed using the longest common substring between the two haplotypes. A total of 14 adjacent alternative haplotypes were collapsed.
The resulting assembly was then screened for contamination. Homozygous single nucleotide polymorphisms (SNPs) and insertion/deletions (InDels) were corrected in the release sequence using ~ 100× of Illumina reads (2 × 150, 400-bp insert) by aligning the reads using bwa-0.7.17 mem [60] and identifying homozygous SNPs and InDels with the GATK v3.6’s UnifiedGenotyper tool [61]. A total of 206 homozygous SNPs and 11,220 homozygous InDels were corrected in the release. Heterozygous SNP/indel phasing errors were corrected in the consensus using the 118.58× raw PacBio data [59]. A total of 66,124 (1.98%) of the heterozygous SNP/InDels were corrected. The final version 1.0 improved release contains 391.2 Mb of sequence, consisting of 25 scaffolds (128 contigs) with a contig N50 of 7.5 Mb and a total of 99.8% of assembled bases in chromosomes. Plots of the Nisqually marker placements for the 19 chromosomes are shown in Additional file 1: Fig. S9.
Genome annotation
Transcript assemblies were made from ~ 1.4 billion pairs of 2 × 150 stranded paired-end Illumina RNA-seq GeneAtlas P. trichocarpa var. Nisqually reads, ~ 1.2 billion pairs of 2 × 100 paired-end Illumina RNA-seq P. trichocarpa var. Nisqually reads from Dr. Pankaj Jaiswal, and ~ 430 M pairs of 2 × 75 stranded paired-end Illumina var. Stettler reads using PERTRAN [41] on P. trichocarpa var. Stettler genome. About ~ 3 M PacBio Iso-Seq circular consensus sequences were corrected and collapsed by genome-guided correction pipeline on P. trichocarpa var. Stettler genome to obtain ~ 0.5 million putative full-length transcripts. A total of 293,637 transcript assemblies were constructed using PASA [62] from RNA-seq transcript assemblies above. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from A. thaliana, soybean, peach, Kitaake rice, Setaria viridis, tomato, cassava, grape, and Swiss-Prot proteomes to repeat-soft-masked P. trichocarpa var. Stettler genome using RepeatMasker [63] with up to 2-kb extension on both ends unless extending into another locus on the same strand. Gene models were predicted by homology-based predictors, FGENESH+ [64] FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), EXONERATE [65], PASA assembly ORFs (in-house homology constrained ORF finder), and from AUGUSTUS via BRAKER1 [66]. The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to the abovementioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score, and protein coverage is the highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more than 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed and weak gene models. Incomplete gene models, low homology supported without fully transcriptome supported gene models and short single exon (< 300-bp CDS) without protein domain nor good expression gene models, were manually filtered out.
SNP calling methods
Illumina HiSeq2500 paired-end (2 × 150) reads were mapped to the reference genome using bwa-mem [60]. Picard toolkit was used to sort and index the bam files. GATK [61] was used further to align regions around InDels. Samtools v1.9 [67] was used to create a multi-sample mpileup for each tree independently. Preliminary SNPs were called using Varscan v2.4.0 [68] with a minimum coverage of 21.
At these SNPs, for each branch, we calculated the conditional probability of each potential genotype (RR, RA, AA) given the read counts of each allele, following SeqEM [69], using an estimated sequencing error rate of 0.01. We identified high-confidence genotype calls as those with a conditional probability 10,000× greater than the probabilities of the other possible genotypes. We identified potential somatic SNPs as those with both a high-confidence homozygous and high-confidence heterozygous genotype across the branches.
We notice that the default SNP calling parameters tend to overcall homozygous-reference allele genotypes and that differences in sequencing depth can bias the relative number of heterozygous SNPs detected. To overcome these issues, we re-called genotypes using conditional probabilities using down sampled allele counts. To do this, we first randomly selected a set number of sequencing reads for each library at each potential somatic SNP so that all libraries have the same sequencing depth at all SNPs. Using the down-sampled reads, we calculate the relative conditional probability of each genotypes by dividing the conditional probabilities by the sum of the conditional probabilities of all three potential genotypes. These relative probabilities are then multiplied by the dosage assigned to their respective genotype (0 for RR, 1 for RA, 2 for AA), and the dosage genotype is the sum of these values across all 3 possible genotypes. Discrete genotypes were assigned using the following dosage values: RR = dosage < 0.1; RA = 0.9 < dosage < 1.1; AA = dosage > 1.9. Dosages outside those ranges are assigned a NA discrete genotype. SNPs with an NA discrete genotype or depth below the down sampling level in any branch of a tree were removed from further analysis. We performed three replicates of this procedure for depths of 20, 25, 30, 35, 40, and 45 reads.
PacBio libraries for each branch were sequenced using the PacBio Sequel platform, fastq files aligned to the P. trichocarpa var. Stettler14 reference genome using ngmlr [70], and multi-sample mpileup files generated using in Samtools v1.9 [67] to quantify the allele counts at the potential somatic SNPs. We used a minimum per-sample sequence depth of 20 reads and used an alternate-allele threshold of 0.1 to call a heterozygote genotype in the PacBio data.
To identify high-confidence candidate somatic SNPs, we identified potential somatic SNPs with the same genotypes across branches using both the Illumina-based PacBio-based genotypes, only including SNPs with full data in all branches for both types of genotypes. Of these, we only retained SNPs that are homozygous in a single branch or have a single homozygous-to-heterozygous transition (and no reversion) going from the lowest to highest branches.
Estimating somatic nucleotide mutation rate
Building on the analytical framework developed in van der Graaf et al. (2015) and Shahryary et al. 2019 (co-submission), we developed mutSOMA (https://github.com/jlab-code/mutSOMA), a statistical method for estimating genetic mutation rates in long-lived perennials such as trees. The method treats the tree branching structure as a pedigree of somatic lineages and uses the fact that these cell lineages carry information about the mutational history of each branch. A detailed mathematical description of the method is provided in Additional file 3: Supplementary Text. But briefly, starting from the .vcf* files from S samples representing different branches of the tree, we let Gik be the observed genotype at the kth single nucleotide (k = 1, …, N) in the ith sample, where N is the effective genome size (i.e., the total number of bases with sufficient coverage). With four possible nucleotides (A, C, T, G), Gik can have 16 possible genotypes in a diploid genome, 4 homozygous (A|A, T|T, C|C, G|G) and 12 heterozygous (A|G, A|T, …, G|C). Using this coding, we calculate the genetic divergence, D, between any two samples i and j as follows:
$$ {D}_{ij}={\sum}_{k=1}^NI\left({G}_{ik},{G}_{jk}\right){N}^{-1}, $$
where I(Gik, Gjk) is an indicator function, such that, I(Gik, Gjk) = 1 if the two samples share no alleles at locus k, 0.5 if they share one, and 0 if they share both alleles. We suppose that Dij is related to the developmental divergence time of samples i and j through a somatic mutation model MΘ. The divergence times can be calculated from the coring data (Additional file 2: Table S14). We model the genetic divergence using
$$ {D}_{ij}=c+{D}_{ij}^{\bullet}\left({M}_{\varTheta}\right)+{\epsilon}_{ij}, $$
where ϵij ∼ N(0, σ2) is the normally distributed residual, c is the intercept, and \( {D}_{ij}^{\bullet}\left({M}_{\varTheta}\right) \) is the expected divergence as a function of mutation model M with parameter vector ϴ. Parameter vector ϴ contains the unknown mutation rate δ and the unknown proportion γ heterozygote loci of the most recent common “founder” cells of samples i and j. The theoretical derivation of \( {D}_{ij}^{\bullet}\left({M}_{\varTheta}\right) \) and details regarding model estimation can be found in Additional file 3: Supplementary Text. The estimation of the residual variance in the model allows for the fact that part of the observed genetic divergence between any two samples is driven both by genotyping errors as well as by somatic genetic drift as meristematic cells pass through bottlenecks in the generation of the lateral branches.
Structural variant analysis methods
For structural variant (SV) analysis, PacBio libraries were generated for four branches from the tree 13 and four branches from tree 14 with four sequencing cells sequenced per branch using the PacBio Sequel platform. PacBio fastq files were aligned to the P. trichocarpa var. Stettler reference genome using ngmlr v.0.2.6 [70] using a value of 0.01 for the “-R” flag. SVs were discovered and called using pbsv (pbsv v2.2.0, https://github.com/PacificBiosciences/pbsv). SV signatures were identified for each sample using “pbsv discover” using the “--tandem-repeats” flag and a tandem repeat BED file generated using trf v4.09 [71] for the P. trichocarpa var. Stettler genome. SVs were called jointly for all 8 branches using “pbsv call.” The output from joint SV calling changes slightly depending on the order of the samples used for the input in “pbsv call,” so four sets of SVs were generated using four different sample orders as input. We used a custom R script to filter the SV output from pbsv [59]. We remove low-complexity insertions or deletions with sequence containing > 80% of a mononucleotide 8-mer, 50% of a single type of binucleotide 8-mer, or 60% of two types of binucleotide 8-mers. We required a minimum distance of 1 kb between SVs. We removed SVs with sequencing coverage of more than three standard deviations above the mean coverage across a sample. After calling genotypes, any SVs with missing genotype data were removed.
Genotypes were called based on the output from pbsv using a custom R script [59]. We required a minimum coverage of 10 reads in all sample and for one sample to have at least 20 reads. We required a minimum penetrance (read ratio) of 0.25 and at least 2 reads containing the minor allele for a heterozygous genotype. We allowed a maximum penetrance of 0.05 for homozygous genotypes. For each genotype, we assigned a quality score based on the binomial distribution-related relative probability of the 3 genotype classes (RR, AR, AA) based on A:R read ratio, using an estimated sequencing error of 0.032, and an estimated minimum allele penetrance of 0.35. For a genotype with a score below 0.9 but with the same genotype at the SV as another sample with a score above 0.98, the score was adjusted by multiplying by 1.67. Any genotypes with adjusted scores below 0.9 were converted to NA. For deletions, duplications, and insertions, 10 representatives in different size classes were randomly selected and the mapping patterns of reads were visually inspected using IGV v2.5.3 [72] to assign scores indicating how well the visual mapping patterns support the SV designation. Scores were defined by the following: “strong,” multiple reads align to the same locations in the reference genome that support the SV type and size; “moderate,” multiple reads align to the same reference location for one side of the SV but align to different or multiple locations in the region for the other side of the SV; and “weak,” reads align to reference locations that indicate a different SV type or much different SV size.
The percent of genic sequence and tandem repeat sequence in deletions and duplications were calculated using the P. trichocarpa var. Stettler annotation and tandem repeat BED from above, respectively. Genome-wide expectations were derived by separating the genome into 10-kb windows and calculating the percent genic and tandem repeat sequence in each window. The distribution of genic and tandem repeat sequences in deletions and duplications were compared to genome-wide expectations using the Kolmogorov-Smirnov two-sample test (one-sided, Nnull = 39,151, Ndel = 10,433, Ndup = 630).
SVs showing variation between branches and identified in all 4 replicates are potential instances of somatic SV mutations or loss-of-heterozygosity gene conversions, and the mapping positions of sequencing reads were visually inspected with IGV [72] to confirm the variation at these SVs.
MethylC-seq sequencing and analysis
A single MethylC-seq library was created for each branch from leaf tissue. Libraries were prepared according to the protocol described in Urich et al. [73]. Libraries were sequenced to 150 bp per read at the Georgia Genomics & Bioinformatics Core (GGBC) on a NextSeq500 platform (Illumina). Average sequencing depth was ~ 41.1× among samples (Additional file 2: Table S8).
MethylC-seq reads were processed and aligned using Methylpy v1.3.2 [74]. Default parameters were used except for the following: clonal reads were removed, lambda DNA was used as the unmethylated control, and binomial test was performed for all cytosines with at least three mapped reads. The methylation levels were determined using weighted methylation level, as mC / (mC + uC) where mC and uC are the number of reads supporting a methylated cytosine and unmethylated cytosine, respectively (C/C + T) [44]. The sodium bisulfite conversion rates were benchmarked against spiked in lambda DNA (which was unmethylated). All rates were well over 99% (Additional file 2: Table S8).
Identification of differentially methylated regions
Identification of differentially methylated regions (DMRs) was performed using Methylpy v1.3.2 [74]. All methylome samples were analyzed together to conduct an undirected identification of DMRs across all samples in the CNN (N = A, C, G, T) context. Default parameters were used. Regions with at least three differentially methylated cytosines (DMS) were combined into raw DMRs. DMS with different directionality (hyper vs hypo) were not combined. Only DMRs that are at least 40 bp long with five or more cytosines (three of which are differentially methylated) with at least one read were used for subsequent analysis. For each DMR, the weighted methylation level was computed as mC / (mC + uC) where mC and uC are the number of reads supporting a methylated cytosine and unmethylated cytosine, respectively [44].
To identify epigenetic variants in these samples, we used a one-sided z-test to test for a significant difference in methylation level of DMRs pairwise between branches [59]. For each pair, only DMRs with at least 5% difference in methylation level were used, regardless of underlying context. Resulting P values were adjusted using Benjamini-Hochberg correction (N = 383,600) with FDR = 0.05 [75], and DMRs are defined by adjusted P value ≤ 0.05.
Identification of methylated regions
For each sample, an unmethylated methylome was generated by setting the number of methylated reads to zero while maintaining the total number of reads. Methylpy DMR identification program [74] was applied to each sample using the original methylome and unmethylated methylome with the same parameters as used for DMR identification. Regions less than 40 bp long, fewer than three DMS, and fewer than five cytosines with at least one read were removed. Remaining regions from all samples were merged using BEDtools v2.27.1 [76].
Assigning genomic features to DMRs
A genomic feature map was created such that each base pair of the genome was assigned a single feature type (transposable element/repeat, promoter, untranslated region, coding sequence, and intron) based on the previously described annotation. Promoters were defined as 2 kb upstream of the transcription start site of protein-coding genes. At positions where multiple feature types could be applicable, such as a transposon in an intron or promoter overlapping with adjacent gene, priority was given to transposable elements, untranslated regions, introns, coding sequences, and promoters. Positions without an assignment were considered intergenic. Genomic feature content of each DMR and methylated region was assigned proportionally based on the number of bases in each category.
GO Enrichment analysis of promoter DMRs was run using topGo v2.34.0 with nodeSize = 10 and weighted Fisher’s exact for BP, CC, and MF ontologies [77]. Significance was determined for P value < 0.001.
Identification of pseudo-allele methylation
We aimed to categorize the DMRs into three pseudo-allele states: homozygous methylated, heterozygous, and homozygous unmethylated. First, DMRs were filtered on the following criteria: (i) at least 25% change in weighted CG methylation level between the highest and lowest methylation level of the samples; (ii) at least one sample had a CG methylation level of at least 75%; and (iii) at least two “covered” CG positions. A “covered” CG is defined as having at least one read for both symmetrical cytosines in all samples. After filtering, 4488 regions were used for analysis.
For each region in each sample, we next categorize the aligned reads overlapping the region [59]. If at least 35% of its “covered” CG sites are methylated, the read is categorized as methylated. Otherwise, it is an unmethylated read. Finally, we define the pseudo-allele state by the portion of methylated reads; homozygous unmethylated: ≤ 25%, heterozygous: > 25% and < 75%, and homozygous methylated: ≥ 75%.
The null distribution was created by randomly shuffling the filtered DMRs in the genome such that each simulated region is the same length as the original and it has at least two “covered” CGs. The above procedure was applied and number of epigenotype changes was determined. This was repeated for a total of 10 times.
The following special classes of DMRs were identified: highly variable, single loss, single gain, and tree specific. A DMR is highly variable if there were pseudo-allele changes between all adjacent branches. A DMR is single loss if all but one branch was homozygous methylated, and one was homozygous unmethylated. Similarly, a DMR is single gain if all but one branch was homozygous unmethylated and one branch was homozygous methylated. Finally, a DMR is “tree specific” if all tree 13 branches were homozygous unmethylated and all tree 14 branches were homozygous methylated or vice versa.
Estimating somatic epimutation rate
We previously developed a method for estimating “germline” epimutation rates in A. thaliana based on multi-generational methylation data from mutation accumulation lines [34]. In a companion method paper to the present study (Shahryary et al. 2019, co-submission), we have extended this approach to estimating somatic epimutation rates in long-lived perennials such as trees using leaf methylomes and coring data as input. This new inference method, which we call AlphaBeta, treats the tree branching structure as a pedigree of somatic lineages using the fact that these cell lineages carry information about the epimutational history of each branch. AlphaBeta is implemented as a bioconductor R package (http://bioconductor.org/packages/devel/bioc/html/AlphaBeta.html). The results detailing the significance of epimutation accumulation as described in Shahryary et al. (2019) are contained in the Additional file 2: Table S9. Using this approach, we estimate somatic epimutation rates for individual CG, CHG, and CHH sites independently, but also for regions. For the region-level analysis, we first use the differentially methylated regions (DMRs) identified above. Sampling from the distribution of DMR sizes, we then split the remainder of the genome into regions, which we refer to as “non-DMRs.” Per sample, we aggregate the total number of methylated Cs and unmethylated Cs in each region corresponding to a DMR or a non-DMR and used these counts as input for AlphaBeta.
In a replication experiment for tree 13 and tree 14, we sequenced the methylomes of leaves from seven branches sampled from the same branches (3 samples from tree 14 and 4 samples from tree 13). Initial quality control of the methylome data revealed that five of the seven samples (14–3.1, 13–1.1, 13–2.1, 13–3.1, 13–5.1) clustered well with their branch-matched replicates. However, two of the samples from tree 14 (14–4.1 and 14–5.1) revealed contamination making them unusable. Therefore, they were excluded from further analysis. Using the remaining five replicates, we re-estimated the genome-wide gain and loss rates and found that they were very similar to those obtained with the original samples (Additional file 1: Fig. S6a-d). In addition to a “complete” tree analysis (involving samples from both tree 13 and tree 14), we also examined epimutation accumulation in tree 13 alone (Additional file 1: Fig. S6e). Similar to the trends we observed with the original samples (Additional file 1: Fig. S6e-h), 5mC divergence increased as a function of age in the replicate data, although these accumulation patterns are not significant due to the small sample sizes (N = 4).
mRNA-seq sequencing and analysis
Total RNA was extracted from leaf tissue in each branch using the Direct-zol RNA MiniPrep Plus kit (Zymo Research) with Invitrogen’s Plant RNA Reagent. Total RNA quality and quantity were assessed before library construction. Strand-specific RNA-seq libraries were constructed using the TruSeq Stranded mRNA LT kit (Illumina) following the manufacturer’s instructions. For each sample, three independent libraries (technical replicates) were constructed. Libraries were sequenced to paired-end 75-bp reads at the GGBC on a NextSeq500 platform (Illumina). Summary statistics are included in the Additional file 2: Table S10.
For analysis, first, paired-end reads were trimmed using Trimmomatic v0.36 [78]. Trimming included removing TruSeq3 adapters, bases with quality score less than 10, and any reads less than 50 bp long. Second, remaining reads were mapped to the Stettler genome with HiSAT2 [79] using default parameters except to report alignments for transcript assemblers (--dta). The HiSAT2 transcriptome index was created using extracted splice sites and exons from the gene annotation as recommended. Last, transcriptional abundances for genes in the reference annotation were computed for each sample using StringTie v1.3.4d [80]. Default parameters were used except to limit estimates to reference transcripts. TPM (transcripts per million) values were outputted to represent transcriptional abundance.
Identification of differentially expressed genes
Differentially expressed genes (DEGs) were identified using DeSeq2 v1.22.2 [49]. The count matrix was extracted from StringTie output files and the analysis was performed using the protocol (ccb.jhu.edu/software/stringtie/index.shtml?t = manual#deseq). Abundances for all samples were joined into one DESeq dataset with α = 0.01. Gene abundance was compared between all samples pairwise. In each pair, a gene was considered differentially expressed if the adjusted P value ≤ 0.01 and the log2-fold change ≥ 1. Genes differentially expressed in any pair were included for subsequent analysis.
Overlap of DMRs and DEGs
We identified DMRs which overlapped the promoter region (2 kb upstream of transcription start site) and gene body of annotated genes. For each DMR-gene pair, we computed Pearson’s product moment correlation coefficient between the weighted methylation level of the DMR and average gene abundance among replicates in TPM. Next, looking only at genes which were previously identified as differently expressed, we performed two-sided Pearson’s correlation test for each DMR-DEG pair to test for statistically significant correlations. Resulting P values were multiple test corrected with Benjamini-Hochberg correction (N = 382, FDR = 0.05) [75]. Adjusted P values ≤ 0.05 were considered significantly correlated.