Construction of an epigenetic mitotic clock (epiTOC): a mathematical model
In order to provide the rationale for the procedure of constructing epiTOC, as described in the next subsection, we here first present the salient features of the underlying mathematical model. We assume that cancer risk of a tissue type t in an individual s, which we denote as CR(s,t), is a monotonically increasing function f of the total number of stem cell divisions (per stem cell) incurred in the tissue, i.e., we assume that:
$$ CR\;\left(s,t\right)=f\left[ TNSC\left(s,t\right)\right] $$
where TNSC is the total number of stem cell divisions, which will depend on tissue type t and individual s. We assume further that TNSC can be approximated as:
$$ TNSC\left(s,t\right)=A(s)\left[IR(t) + E(s)ER(t)\right]=A(s)IR(t) + A(s)E(s)ER(t) $$
(1)
where A(s) denotes the chronological age of the individual s, IR(t) denotes the intrinsic rate of stem cell divisions per stem cell in tissue type t, E(s) is a complex non-linear (generally unknown) positively valued function representing the exposure of individual s to a cancer risk factor, and where ER(t) denotes an extrinsic rate of stem cell division associated with exposure to the cancer risk factor and which we assume may depend on tissue type t. Note that we assume that the intrinsic rate IR only depends on tissue type and that it is, therefore, independent of chronological age and individual s, so we are ignoring genetic factors which may influence the rate of stem cell division.
Motivated by previous work [4], we further assume that specific CpG sites in the genome acquire stochastic DNAm errors during cell replication and that the cumulative number of DNAm errors is a linear function of the total number of stem cell divisions per stem cell. We denote the cumulative amount of DNAm errors as “pcgtAge” in anticipation of how the corresponding CpGs will be identified (see the “Construction of an epigenetic mitotic-like clock: CpG selection” section below). Hence, we can also write a linear model of the form:
$$ pcgtAge\left(s,t\right) = \alpha (t) + \xi (t)\ TNSC\left(s,t\right) + \varepsilon $$
(2)
However, we can’t train a DNAm-based model with TNSC, since the latter depends on the exposures which are generally unknown. Instead, we could link Eqs. 1 and 2 above to train a DNAm-based model. Since IR is dependent on the tissue type, and since E(s) and ER(t) are generally unknown quantities, it is best to focus on one tissue type only and to consider a healthy population of individuals. This then allows us to assume that E(s) ≈ 0 and that IR(t) = constant, so, to a first approximation:
$$ \begin{array}{c} pcgtAge\left(s,t\right) = \alpha (t) + \xi (t)\ A(s)IR(t) + \varepsilon \\ {} = \alpha (t) + \gamma (t)\ A(s) + \varepsilon \end{array} $$
where we have absorbed the term IR into a new slope coefficient. Thus, to identify CpGs whose DNAm levels correlate with TNSC, it is justified to correlate DNAm to chronological age, as long as we use one tissue type and focus on healthy individuals. However, since many biological processes may be associated with distinct age-associated DNAm changes, it is necessary to make a selection of CpGs which are more likely to capture the DNAm aberrations caused by cell division.
Construction of an epigenetic mitotic-like clock: CpG selection
Motivated by the above mathematical model, we posited that we could identify relevant CpGs as follows: (i) identify CpG sites undergoing age-associated DNAm changes in a large DNAm dataset of healthy individuals, encompassing one tissue type only and correcting for potential changes in cell type composition; (ii) identify a subset of these that map to PCGT promoters, i.e., marked by the PRC2 complex, and which are constitutively unmethylated in a ground state of age zero (e.g., fetal tissue).
To justify (i), we reasoned that using multiple tissues, which would naturally differ in their mitotic tick rates, would only hamper or confound derivation of a mitotic clock (see above mathematical model). Correction for underlying changes in cell type composition is, however, an important potential confounder if we are to use only one tissue type. For these reasons, we used the dataset of 656 whole blood samples from Hannum et al. [28], representing one of the largest cohorts of healthy individuals which have been profiled with Illumina 450 k DNAm beadarrays, and a tissue type (blood) for which accurate correction for changes in blood cell type composition is possible [62]. We justify (ii) on grounds that a recent study has shown that DNAm changes occurring during hematopoietic ontogeny involve preferentially DNAm increases at PCGT promoters, i.e., sites marked by the PRC2 complex [26]. Thus, we reasoned that focusing on a subset of such promoter CpGs which are also constitutively unmethylated in a large set of fetal tissues [63] would provide us with the right markers to measure the rate of cell division.
In detail, using Hannum’s whole blood samples, we ran linear regressions of chronological age versus DNA methylation beta profiles adjusted for plate, sex, and estimates of blood cell subtypes. Estimates of blood cell subtypes were obtained using quadratic programming [64] and a novel blood cell subtype DNAm reference matrix (Additional file 1) constructed by integrating the Illumina 450 k DNAm data from Reinius et al. [65] with blood cell subtype-specific DNase hypersensitive site data from the NIH Epigenomics Roadmap (Teschendorff A et al: A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies, submitted). Age-associated CpGs were selected at a false discovery rate threshold of <0.05. Subsequently, these age-CpGs were filtered for those mapping unambiguously to within 200 bp of a transcription start site (TSS200 probes). We note that this restriction to TSS200 probes was done in order to minimize differences in the ground state (i.e., at age zero) methylation levels between probes, which facilitates the later construction of the pcgtAge score as an average of the probes. With this restriction to TSS200 probes as well as the inherent restricted coverage of the 450 k beadarrays, we nevertheless still covered 72 % of all PCGT promoters as defined in Lee et al. [66]. The age-associated TSS200 CpGs were then divided into age-hypermethylated CpGs and age-hypomethylated ones. Age-hypermethylated CpGs were filtered further, selecting only those with absent or low (beta <0.2) methylation across 52 fetal tissue samples encompassing 11 tissue types (cord blood (GSE72867), stomach, heart, tongue, kidney, liver, brain, thymus, spleen, lung, adrenal gland [63]). These unmethylated promoter CpGs were divided further into those marked by PRC2 in human embryonic stem cells (hESCs) and those that are not, according to the annotation provided [66]. This resulted in an age-hypermethylated set of 385 CpGs, which we denote “pcgtAge”. In the case of the age-hypomethylated promoter CpGs, we selected those with a methylation level of at least 0.3 across all 52 fetal tissue samples in order to guarantee that the observed hypomethylation at these sites is genuine and of potential biological significance. This resulted in a second set of 656 CpGs, denoted “hypoAge”.
Age-correlative DNAm deviation scores, pcgtAge and hypoAge, were then calculated as the average DNAm over the respective CpG sites. Mathematically, pcgtAge for sample s in tissue type t, is calculated as:
$$ pcgtAge\left(s,t\right)=\frac{1}{n}{\displaystyle {\sum}_{c=1}^n{\beta}_{cst}} $$
By construction, this score should correlate with chronological age, and we can estimate parameters α’ and γ’ from fitting a linear model:
$$ pcgtAge\left(s,t\right)\kern0.5em =\kern0.5em \alpha (t)\kern0.5em +\kern0.5em \gamma (t)\kern0.5em A(s)+\kern0.5em \varepsilon ' $$
From this, we can now obtain an estimate of TNSC(s,t) for a sample of the same tissue type t but which is not healthy, e.g., one exposed to a cancer risk factor:
$$ TNSC\left(s,t\right)=\left[ pcgtAge\left(s,t\right)\kern0.5em \hbox{--} \kern0.5em \alpha '(t)\right]IR(t)/\kern0.5em \gamma '(t) $$
(3)
since estimates for IR(t) are available from the literature [2]. Thus, for the same tissue type t used in the training (i.e., blood) and from which the estimates α’ and γ’ were obtained, we can derive estimates of TNSC but not so for a different tissue type (e.g., lung or breast). However, the second term in Eq. 3 above is independent of the sample s; hence, one can write for the cancer risk:
$$ CR\left(s,x\right)=f\left[ TNSC\left(s,x\right)\right]=f\left[ pcgtAge\left(s,x\right)IR(t)/\ \gamma '(t)\kern0.5em \hbox{--} \kern0.5em term(t)\right] $$
and so, if pcgtAge(s
1
,x) > pcgtAge(s
2
,x), then the cancer risk (CR) of sample s
1
is also greater than that of s
2
: CR(s
1
,x) > CR(s
2,
x). Hence, a higher pcgtAge score should be indicative of a higher mitotic tick rate and indicate a higher cancer risk, which we can formally test if data for pre-cancerous lesions are available. Similarly, a lower hypoAge score would correspond to a higher cancer risk.
From Hannum et al. [28] whole blood data set, we estimated for the pcgtAge model that α’ = 0.052 and γ’ = 0.000345. From the 81 normal breast tissue samples from TCGA, we estimated α’ = 0.053 and γ’ = 0.000165, and from the combined 73 normal lung samples from the TCGA we estimated α’ = 0.021 and γ’ = 0.000588.
We note that even in the absence of an IR(t) estimate one can still estimate the relative TNSC numbers of two samples s
1
and s
2
of the same tissue type since the ratio:
$$ TNSC\left({s}_1,\ t\right)\kern0.5em /\kern0.5em TNSC\left({s}_{2, }t\right)\kern0.5em =\kern0.5em \left[ pcgtAge\left({s}_1,t\right)\ \hbox{--}\ \alpha '(t)\right]\kern0.5em /\kern0.5em \left[ pcgtAge\left({s}_2,t\right)\kern0.5em \hbox{--} \kern0.5em \alpha '(t)\right] $$
does not depend on IR(t).
Validation of the age-correlative models in blood and other normal tissue types
EpiTOC and the analogous model based on age-hypomethylated sites were tested in the large independent whole blood Illumina 450 k dataset of Liu et al. [31] using only healthy controls (over 300 samples). We also tested these two models in other normal tissue types. For this purpose, we used normal-adjacent tissue from TCGA, focusing on tissues for which there were enough normal samples and for which there had been corresponding fetal tissue used in the derivation and selection of the CpGs making up these models. This included 38 normal colon (normal-adjacent tissue to colon adenocarcinoma (COAD)), 160 normal kidney (adjacent to kidney renal cell carcinoma (KIRC)), 47 normal liver (adjacent to liver hepatocellular carcinoma (LIHC)), and 73 normal lung samples (adjacent to lung squamous cell carcinoma (LSCC)/ lung adenocarcinoma (LUAD)).
Validation of the epigenetic mitotic clock in normal tissue
In order to demonstrate that our age-correlative models define approximate mitotic clocks, i.e., that the relation TNSC(s,t) ~ pcgtAge(s,t) holds, we estimated the age-correlative scores in the normal tissue samples of TCGA [37], for which estimates of the intrinsic stem cell division rates (IR) per stem cell and per year are available from [2]. Specifically, this included colon (IR = 73 divisions per stem cell per year), rectum (IR = 73), esophagus (IR = 17.4), head and neck (IR = 21.5), liver (IR = 0.9125), lung (IR = 0.07), pancreas (IR = 1) and thyroid (IR = 0.087), encompassing a total of 288 normal tissue samples. Cumulative total number of cell divisions per stem cell (i.e., TNSC) was estimated for each of these 288 normal “healthy” samples as the product of chronological age (tissue-independent) and the corresponding rate IR (tissue-dependent), i.e., as TNSC(s,t) = A(s)IR(t). The sample-specific scores pcgtAge, nonpcgtAge, and hypoAge were then correlated to these sample-specific cumulative cellular turnover rates using a linear regression framework adjusted for chronological age (in order to avoid the expected trivial correlation by age).
Validation of the epigenetic mitotic clock in cancer tissue: construction of an mRNA expression based-mitotic index
Because cell division rates are altered in cancer, we validated the mitotic nature of the age-correlative scores pcgtAge and hypoAge in cancer samples by comparison of these scores to an mRNA expression-based mitotic index. This mitotic index was constructed by first taking the interConstruction of the epiTOC model of genes in the cell proliferation cluster of Ben-Porath et al. [67] and those of the proliferation signature of Rhodes et al. [68]. This resulted in nine genes (CDKN3, ILF2, KDELR2, RFC4, TOP2A, MCM3, KPNA2, CKS2, and CDC2). The mitotic index was then defined as the average mRNA expression over these nine genes. This mRNA expression-based mitotic index was validated in 15 cancer types of TCGA by demonstrating that it is significantly increased in each cancer type compared to its corresponding normal tissue type. We verified that it was a more reliable mitotic index than PCNA expression (not shown).
Cancer and pre-cancerous Illumina 450 k datasets
We downloaded and processed level 3 Illumina 450 k and RNA-SeqV2 data from TCGA [37], as described by us previously [69]. In total, we considered 15 cancer types: BLCA (bladder adenocarcinoma, nN = 19, nC = 204), BRCA (breast adenocarcinoma, nN = 81, nC = 652), COAD (colon adenocarcinoma, nN = 38, nC = 272), ESCA (esophageal carcinoma, nN = 15, nC = 126), HNSC (head and neck squamous cell carcinoma, nN = 45, nC = 402), KIRC (kidney renal cell carcinoma, nN = 160, nC = 299), KIRP (kidney renal papillary carcinoma, nN = 45, nC = 196), LIHC (liver hepatocellular carcinoma, nN = 47, nC = 176), LSCC (lung squamous cell carcinoma, nN = 41, nC = 275), LUAD (lung adenomacarcinoma, nN = 32, nC = 399), PAAD (pancreatic adenoma carcinoma, nN = 10, nC = 146), PRAD (prostate ademoma carcinoma, nN = 48, nC = 278), READ (rectal adenoma carcinoma, nN = 7, nC = 95), THCA (thyroid carcinoma, nN = 53, nC = 489), and UCEC (uterine cervix endometrial carcinoma, nN = 34, nC = 374).
We used Illumina 450 k DNAm data from the three previous publications and which had profiled precursor cancer lesions or normal-adjacent tissue in addition to normal samples. Briefly, these datasets were: (i) a dataset of normal lung and lung carcinoma in situ (LCIS) samples, with a subset of these progressing to an invasive lung cancer, previously described in [20]; (ii) a dataset of normal breast tissue and ductal carcinoma in situ (DCIS) samples, with a subset of these progressing to an invasive breast cancer, previously described in [40]; (iii) a dataset of 50 normal breast tissue samples, 42 matched normal-adjacent breast tumor pairs, and an additional 263 breast cancers, previously described in [41].
Age-correlative scores pcgtAge and hypoAge were estimated as average DNAm levels over the corresponding CpG sites in all of these samples.
Buccal tissue Illumina 450 k set
Illumina 450 k DNAm profiles were generated for buccal samples from 790 women, all aged 53 at the time of sampling, as described by us previously [20]. For a subset of 152 women, there were matched buccal–blood samples. We used the normalized data as used in our previous publication.
Non-TCGA cancer tissue and ENCODE cell line DNAm datasets
Illumina 450 k DNAm data for 32 glioblastoma multiformes (GBM) were downloaded from the Gene Expression Omnibus (GEO; accession number GSE30338) [70] and normalized with BMIQ. Illumina 450 k data for 215 ovarian cancers was processed and normalized with BMIQ as described by us previously (GEO: GSE74845) [71]. Illumina 27 k DNAm data for a total of 49 cervical cancer epithelial samples were processed and normalized as described by us previously (GEO: GSE30759) [15]. Cell line Illumina 450 k DNAm data for 62 cell lines was obtained from ENCODE via GEO (GSE40699). These data were subsequently normalized with BMIQ.
Purified T-cell, B-cell and monocyte Illumina 450 k sets
Illumina 450 k profiling was performed on 100 purified CD19+ B-cell samples, 98 CD4+ T-cell samples, and 104 CD14+/CD16− monocytes from a total of 52 monozygotic twins discordant for type 1 diabetes, as described by us previously (Paul D et al: Increased DNA methylation variability in type 1 diabetes across three immune effector cell types, submitted) Here we only used the samples from the healthy controls, amounting to 50 B-cell, 49 T-cell, and 52 monocyte samples. Across all cell types, the mean cell purity was 90 %. The Illumina 450 k data were processed as described (Paul D et al: Increased DNA methylation variability in type 1 diabetes across three immune effector cell types, submitted) and are available from the European Genome-phenome Archive (EGA; https://www.ebi.ac.uk/ega/) under accession number EGAS00001001598.
In addition, we used Illumina 450 k data of an independent set of purified CD4+ T-cell (n = 214) and monocyte (n = 1202) samples, as generated by the MESA study [33]. These data were downloaded from the GEO (GSE56046 and GSE56581). Intra-array normalization was performed with BMIQ.
Stem cell Illumina DNAm sets
We downloaded normalized Illumina 27 k data for two sets of stem cell-like cell populations. One set had profiled eight mesenchymal stem cell (MSC) samples (all of the same low passage number of 2) collected from the bone marrow of eight donors of widely different ages [35]. Data were obtained from the GEO (GSE17448). Another set consisted of 12 CD34+ hematopoetic progenitor cell (HPC) samples collected from either cord blood (n = 7) or mobilized peripheral blood from adults (n = 5, age range 28 to 41 years) [36]. Data were obtained from EBI’s ArrayExpress repository (E-MTAB-487).
Software availability
An R script implementing epiTOC and the associated probe IDs of the CpGs making up epiTOC is available as Additional files 5 and 6.