Generative model of methylation signal
To illustrate our model, we consider the case of estimating differences in methylation between DLPFC (D) and HF (H). We assume these brain tissues are composed of two cell types, NeuN+ (+) and NeuN- (-). For a fixed genomic position, we let μj,k be the methylation level in region j, j ∈ {H, D} and cell type k, k ∈ {+, -}. Scientifically, we are interested in identifying genomic locations where μH,k - μD,k ≠ 0, that is, where NeuN+ or NeuN- have different methylation levels in the two brain regions.
Given a sample i and considering a fixed genomic position, we define Xi as the indicator that sample i is from the hippocampus, that is, Xi = 1 if sample i is from the hippocampus and 0 otherwise. We also define πi to be the fraction of sample i that consists of NeuN- cells (1 - πi is the fraction of NeuN+ cells). We can then derive the expected value of the methylation signal of sample i at that genomic position as
Rearranging terms gives:
(1)
Suppose we wanted to estimate whether there is a difference in methylation between the two brain regions being considered, H and D. If we fit a model with terms matching those above, that is,
(M2)
then our estimated coefficients have interpretations equivalent to the generative model in Equation 1. Specifically, we can test the hypothesis of no difference in NeuN+ methylation between D and H (μH,+ - μD,+ = 0) by testing the hypothesis that β2 = 0, and the hypothesis of no difference in NeuN- methylation between D and H (μH,- - μD,- = 0) by testing the hypothesis that β3 = 0.
From the equations above, we can see that estimating the fraction of cells of each type, πi, allows us to explicitly find locations with brain-region differences specific to NeuN+ or NeuN- cells.
Naïve models are biased
In general, πi is unknown and therefore not included in the linear model, that is, the model
(M1)
is fitted. However, this model does not account for all the sources of variation in Yi, and the least squares estimate is a biased estimate of the difference in methylation between H and D under the null hypothesis. To see this, we can write , where X is the design matrix of the above model and is the vector (, ) and the hats represent least squares estimates. For simplicity, we assume equal numbers of samples from H and D. We then have
Where is the mean fraction of NeuN- cells in region j. Under the null hypothesis of no difference between D and H in either + or -, we have and also , which gives
This means that where + and - have different methylation levels (δ ≠ 0), a difference in the fractions of + and - cells in the different brain regions can lead to false-positive signals of brain region differences in methylation.
Guintivano et al. [20] estimate and propose an ad hoc approach to adjust for this that is approximated by fitting the following model
(M3)
However, this model does not account for all the sources of variation in Yi either and the least squares estimate is a biased estimate of the difference in methylation between H and D. To see this, we can write , where X is the design matrix of the above model and is the vector and the hats represent least squares estimates. For simplicity, we assume equal numbers of samples from H and D. We then have
Where K is a function of the 's that does not depend on the sample size:
With and the average of the 's in all samples and H samples, respectively, and and the average of the 's in all samples and H samples, respectively. Note that the bias is directly proportional to the difference between NeuN+ and NeuN- fractions, demonstrating that this approach is incapable of deconvolving these quantities of interest.
Estimation of mixture proportions
Although we have shown that fitting the mis-specified model, which does not include the cell-fraction terms, can lead to bias under the null hypothesis, the cell fractions for a given sample are unknown a priori. At any given methylation site, we are assuming that there is some underlying mean methylation value within each combination of cell type (+, -) and brain region (D, H). If we know these underlying means, we can derive an estimate of the unknown cell fraction at a particular site, given an observed methylation signal and assuming the generative model above. For example, suppose sample i is from D and we observe methylation signal Yi at a given locus. From Equation 1, we have
(2)
If we assume and are known, is the only unknown in this equation, so it can be estimated. Note that we do need to constrain our estimate of to be between 0 and 1. Also, the means μ are not known, so we collected data to allow us to estimate these means, by measuring methylation in pure cell sorted + or - fractions from each brain region of interest. Given that these methylation measurements have uncertainty, we want to reduce the uncertainty in our estimate of by using many informative genomic regions. We first select a set of genomic regions where + and - methylation differs. We then find the optimal value of to explain the observed methylation for sample i in these locations, as a function of our estimated means and , subject to the constraint that is between 0 and 1. This procedure closely follows that presented by Houseman et al. [19].
Selection of the genomic locations can be based on a variety of factors, such as the range of observed methylation at these locations, the variance of the methylation estimates, and the length of the region of differential methylation. For our estimation, we chose the 500 genomic regions which were the strongest + vs. - DMR candidates in the brain region of interest in relation to the amount of methylation difference and the length of the region showing the methylation difference. We found that our results were quite robust to the number of regions selected, with 500 performing well.
To investigate whether it is absolutely necessary to have sorted data from a given brain region to estimate cell proportions in unsorted data from that region, we identified a set of 'universal' genomic regions. These universal regions had different NeuN+ and NeuN- methylation signals within a brain region, but showed consistent NeuN+ and NeuN- methylation levels across the three brain regions for which we had data (DLPFC, HF, and STG). Many of these + vs. - DMR candidates had consistent NeuN+ and NeuN- levels across brain regions, with 14% to 17% of the probes in the + vs. - DMRs belonging to genomic regions of consistent signal. We estimated the means μ in these regions of consistent signal using sorted data from DLPFC alone, and then performed cell fraction estimation in the unsorted samples from DLPFC, HF, and STG using these mean values. Since we do not know the true cell fractions in these unsorted samples, we used the estimates we had obtained for each brain region using the region-specific DMRs and mean values, as described above, as our gold standard.
All analysis was implemented in R (R Core Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria, 2012; [33]). The data discussed in this publication have been deposited in NCBI's Gene Expression Omnibus and are accessible through GEO series accession number GSE48610.
Effect of inaccurate mixture estimates
As previously described, failure to account for differences in cell-mixtures in our samples can lead to biased estimates of brain-region differences under the null hypothesis of no brain region difference. However, inaccurate mixture estimates can also lead to bias. For example, consider the methylation signal in sample i
Now suppose we have an inaccurate estimate of , called , where . Using this inaccurate estimate gives us the following contribution to our regression formulation from sample i:
where ηi is between 0 and 1, and the third line follows from the fact that γi must be between and to ensure that is between 0 and 1. We can see that the coefficient of is no longer measuring just the quantity we are interested in (the difference between NeuN+ methylation in regions H and D), but it also has an additional factor related to the size of the estimation error, and similarly for the coefficient of .
CHARM DNA methylation analysis
Genomic DNA was isolated from brains using the Masterpure kit from Epicentre, according to the manufacturer's protocol. For genome-wide DNA methylation assessment, 1 ug of genomic DNA from each sample was digested, fractionated, labeled, and hybridized to a CHARM array as described [34, 35] using a custom Nimblegen 2.1 million feature array assaying 5,114,655 CpG sites. We used the Bioconductor package 'charm' for sample preprocessing along with the package 'bumphunter' for DMR identification and permutation computation.
Human postmortem brain samples
Fluorescence-activated cell sorting was performed on frozen postmortem dorsolateral prefrontal cortex (n = 4), and hippocampal formation (n = 4) and superior temporal gyrus (n = 3) from individuals not affected with neurological or psychiatric disease. To validate the statistical model, we used nine additional healthy samples from the dorsolateral prefrontal cortex. These samples underwent nuclei extraction and sorting as described below. The model was applied to additional unsorted control samples (19 samples from DLPFC, 13 samples from HF, 31 samples from STG) to deconvolve NeuN+ and NeuN- methylation signatures. All samples were obtained from the bank of the Center for Neurodegenerative Disease Research (CNDR) in the Department of Pathology and Laboratory Medicine at the University of Pennsylvania (directed by Dr John Q Trojanowski, see Additional File 1, Tables S2-4 for demographic information).
Nuclei extraction, NeuN labeling, and sorting
Total nuclei were extracted via sucrose gradient centrifugation as previously described [25]. A total of 250 mg of frozen tissue per sample was homogenized in 5 mL of lysis buffer (0.32M sucrose, 10 mM Tris pH 8.0, 5 mM CaCl2, 3 mM Mg acetate, 1 mM DTT, 0.1 mM EDTA, 0.1% Triton X-100) by douncing 50 times in a 40-mL dounce homogenizer. Lysate was transferred to a 15 mL ultracentrifugation tube and 9 mL of sucrose solution (1.8 M sucrose, 10 mM Tris pH 8.0, 3 mM Mg acetate, 1 mM DTT) was pipetted to the bottom of the tube. The solution was then centrifuged at 27,000 rpm for 2.5 h at 4C (Beckman, L8-80 M; SW28.1 rotor). After centrifugation, the supernatant was removed by aspiration and the nuclei pellet was resuspended in 500 uL of PBS.
The nuclei were incubated in a staining mix (0.71% normal goat serum, 0.036% BSA, 1:1200 anti-NeuN NeuN (Millipore, MAB377), 1:1400 Alexa647 goat anti-mouse secondary antibody (Invitrogen, 21236) for 45 min by rotating in the dark at 4°C. Unstained nuclei and nuclei stained with only secondary antibody served as negative controls. The fluorescent nuclei were run through a FACS machine with proper gate settings. A small portion of the NeuN+ and NeuN- nuclei were re-run on the FACS machine to validate the purity. Immunonegative (NeuN-) nuclei were collected in parallel. To pellet the sorted nuclei, 2 mL of sucrose solution, 50 uL of 1 M CaCl2, and 30 uL of Mg acetate were added to 10 mL of nuclei in PBS, incubated on ice for 15 min, then centrifuged at 3,000 rpm for 20 min. The nuclei pellet was resuspended in 10 mM Tris (pH 7.5), 4 mM MgCl2, and 1 mM CaCl2. Fluorescent images were taken on a Zeiss Axio Observer. Z1 microscope with a Plan-Apochromat 100x/1.40 oil-immersion objective lens. Images were generated using an Axiocam MR3 microscope camera and Axiovision software (AxioVs40, version 4.8.2.0, Carl Zeiss, Inc). Images were processed using ImageJ.