The role of longitudinal cohort studies in epigenetic epidemiology: challenges and opportunities

Longitudinal cohort studies are ideal for investigating how epigenetic patterns change over time and relate to changing exposure patterns and the development of disease. We highlight the challenges and opportunities in this approach.


Introduction
Interest in the role of epigenetic processes in common complex diseases continues to increase [1,2]. Epigenetics is a potentially major mechanism by which environmental factors can affect physiological function and disease risk. Research into epigenetics promises to reveal many of the causes that remain undiscovered after extensive investi gation of common genetic variation [3].
Epidemiological approaches can be used to identify whether epigenetic processes are involved in mediating the association between risk factors (environmental, genetic, lifestyle, socioeconomic and so on) and common complex disease [4,5]. For example, longitudinal cohort studies have been a cornerstone of observational epi demio logy for many years. Longterm followup of adult cohorts has identified important risk factors for cardio vascular disease, chronic bronchitis, and cancers, and followup of cohorts from birth or childhood has been equally successful at identifying the importance of early exposures (especially the childhood social environment) and developmental characteristics for adult health (for example, [610]). Longitudinal studies, particularly those that start in early life, can contribute to our understanding of how the epigenome changes over time, as a result of varying environmental exposures, and how disease pheno types evolve. Longitudinal studies are costly to instigate and maintain, and crosssectional studies (a less expensive alternative study design) have more often been used to assess the relationship between exposures and the epigenome and/or the epigenome and disease. How ever, crosssectional studies cannot capture the dynamic nature of epigenetic mechanisms [11], making it difficult to identify the influences of the environment and/or disease state (or subclinical features of disease) on the epigenome and thus establish the direction of causality. As a result of this, study designs that make use of multiple time points are being increasingly recognized as the most suitable to analyze the epigenetics of common complex diseases. Because longitudinal studies track the same cohort at multiple time points throughout their lifetime, enabling the temporal relationship between exposure and disease to be established, they are ideally placed for exploitation in epigenetic investigations.
Advances in genomic technologies have opened up the possibility of largescale populationbased assessment of epigenetic patterns to help understand their influence on disease. How should such studies be conducted to maxi mize their impact and what can epigenetics researchers learn from previous approaches to populationbased studies? Here we focus on how epidemiological approaches, including the design of cohort studies, can help investi gate the role of epigenetic variation in common complex disease. Furthermore, the dynamic nature of epigenetic patterns means that they can be altered by disease related factors (a process called 'reverse causation') as well as a host of confounding factors (such as age, sex, socioeconomic position, diet, or smoking). Many relevant approaches have been developed in the context of both genetic and life course epidemiology that could be fruitfully applied to epigenetics; examples are methods for dealing with biases, confounding, and reverse causa tion and also longitudinal statistical modeling techniques [12,13]. We first assess what epigenetic markers have been measured within existing life course studies before epigenetic studies within longitudinal cohorts Since 2010, 34 life course studies have included measure ments of DNA methylation, and just four of these have included analysis of epigenetic features at more than one time point (Table 1). In line with the vast majority of other epigenetic studies, the focus is on DNA methylation as this is the most straightforward form of epigenetic modification to measure, and the only currently feasible option in archived DNA samples. Prospective sample collection will permit the analysis of chromatin modifica tions and microRNA. Three of the studies analyzing more than one time point (Table 1) report findings relat ing specifically to agerelated changes in childhood [14] or adulthood [15,16], and all three focus on genespecific DNA methylation of a small panel of (different) loci and report differences that were modest in size (generally <5%). A further study considers changes in DNA methy la tion over a relatively short time period (28 to 180 days) in relation to air pollution exposure [17]. Although there was some indication of lower global DNA methylation in repetitive elements across the genome in this study [17] at 90 days of exposure, there was no evidence of a dose response, casting doubt on the biological importance of this association. In summary, very little has been done in this area. Table 2 summarizes additional examples in which case control studies of DNA methylation have been nested within existing largescale longitudinal cohorts; this approach has been applied so far exclusively in the context of cancer. Analyses in this instance have been limited to gene panels (generally established tumor sup pressor or oncogenes) and have been undertaken either (i) to assess the utility of epigenetic signatures as early biomarkers of cancer risk [1820] or (ii) to consider the determinants of a perturbed methylation state (methy lator phenotype), which has been implicated in numerous cancers [2125]. With improved knowledge of methyla tion variable regions associated with diseases other than cancer (for example, cardiovascular disease, dementia, and rheumatoid arthritis), the same approach could be adopted in the context of longitudinal cohort studies.
The paucity of DNA methylation measurements under taken in cohorts that have collected serial samples from the same individuals is clear, indicating that the potential richness of longitudinal data and sampling in these studies has yet to be fully exploited. Few studies have routinely collected serial samples from the same indivi duals at multiple points in the life course (for example, the Avon Longitudinal study of Parents and Children (ALSPAC) [26,27], and the Normative Aging Study [17,2832]), but others are planning serial sampling in light of the interest in epigenetics (such as the Medical Research Council National Survey of Health and Develop ment [33] and the Southall And Brent REvisited (SABRE) cohort [34]). Given the temporal variation in epigenetic patterns, serial sampling of any longitudinal cohort would be advised where possible.
Of the studies published so far, the variety of tissues analyzed is limited mainly to easily accessible peripheral blood, cord blood or buccal cells, the studies are modest in size compared with those used for genetic research, and the range of different methods that have been used to quantify DNA methylation have led to an overall lack of comparability between studies. It is clear from these observations that more can be done with respect to the collection and analysis of biological samples from longitudinal cohorts so that they are optimal for epigenetic studies.

Attributes of longitudinal cohort studies
Ideally, longitudinal epigenetic studies should include extensive, prospectively collected data and biological samples at multiple time points across the life course. Many existing longitudinal cohort studies are population based, although some focus on a specific subgroup of the general population. For example, the SABRE cohort focuses on groups that are first or second generation migrants to the UK of nonEuropean ethnicity to examine particular health issues, in this case the marked discordance in disease risk observed in migrant groups compared with Europeans living in the UK [34]. Longitudinal epigenetic studies can add value to existing resources, such as data from genomewide association studies for example, ALSPAC [26,27] and the Relation ship between Insulin Sensitivity and Cardiovascular disease (RISC) cohort [35]. Exposures commonly cap tured in longitudinal studies include lifestyle factors, such as smoking, alcohol intake, diet, and physical activity patterns, and also socioeconomic measures across the life course. Common phenotypes on which longitudinal studies tend to focus include physical and anthropometric measures, cognitive, cardiovascular, metabolic, respira tory, and musculoskeletal function, and a range of blood based intermediate biomarkers. Of particular value are birth cohorts with transgenerational and acrosslife samples from birth onwards, allowing an appraisal of epigenetic changes associated with in utero and early life exposures, a period when the epigenome is believed to be particularly plastic.

Applying principles of life course epidemiology to epigenetic research
Research in life course epidemiology investigates develop mental, aging, and risk factor trajectories and how  dynamic relationships unfold over time, and takes into account potential confounding, mediating, or interactive effects of lifetime biological, psychological, and social risk factors [36]. This conceptual framework is relevant for epigeneticists investigating longterm associations that may be biased, confounded or due to reverse causa tion. Life course epidemiologists have investigated various different methods for modeling risk factor trajec tories (particularly growth trajectories) in relation to later health outcomes and have developed a novel structured approach [37] to distinguish critical, sensitive, and accu mu lation life course models [38]. They use a range of approaches for modeling repeat continuous and binary outcome measures, such generalized estimating equa tions or mixed models that consider correlated data such as repeat measures from the same individuals over time, and for modeling time to an event, such as survival and event history analysis. This toolkit is relevant to epi geneticists, whether studying lifetime environmental exposures that promote particular epigenetic signatures over time or how these signatures themselves may affect not just the level (intercept) of function (such as blood pressure) at a point in time but also its rate of change (slope) over time. Such statistical approaches have not been widely applied to epigenetic data, although exam ples can be found in Madrigano et al. [16,17], who illus trate the use of mixed models to analyze changes in methylation over time while accounting for the corre la tion among measurements within the same individual. Further discussion of this subject is provided below in the section on data analysis considerations. Several research collaborations involving cohort studies, such as HALCyon (Healthy Aging across the Life Course) [39], FALCon (Function Across the Life Course) [40] and GEoCoDE (Genomic and Epigenomic Complex Disease Epidemiology) [41] have been formed. These have increased the sample size and power to investigate lifetime risk factors on longitudinal phenotypes and to test whether findings are replicated across cohorts in a systematic way, and they will be useful to epigenetics research. The collaborations have developed experience in data har mo ni zation to derive comparable phenotypes across the cohorts, and in crosscohort methods (for example, [42]). Those running epigenetic studies may want to make use of these collaborations for similar reasons, and a coordinated approach is likely to advance the science and be appealing to funders. Coordinating the cohorts has led to more effective ways of gaining knowledge of the various datasets and metadata as well as facilitating data sharing and encouraging good practice in data management.

From genetic to epigenetic epidemiology
Incorporating epigenetic measures into epidemiological studies is often done in the context of genetic epidemiology resources. However, studying epigenetic factors which are, partly at least, phenotypic is more similar to conventional epidemiology than it is to genetic epidemi ology. Several aspects of germline genetic variation lead to specialcase conditions that allow relaxation of usual epidemiological principles: reverse causation (disease influencing the variable being measured rather than vice versa) is clearly not an issue in genetic epidemiology, and confounding which often vitiates conventional epide miology generally relates only to ancestry in genetic epidemiology [43], and this can be accounted for by using principal components from genomewide data as control variables. Germline genetic variation can be assessed on samples taken at any stage of life, does not change over time, and can be assayed with high precision and low measurement error. Effect sizes for the influence of common genetic variants on common complex diseases tend to be small, which means that very large sample sizes are required. Given these circumstances, the genetic epidemiology study design of choice became large case control studies, with the controls not being carefully selected to represent the source population and some times (as in the case of the landmark Wellcome Trust Case Control Consortium (WTCCC) [44]) control groups shared for comparison with several disease groups. For example, in the WTCCC the common control groups consisted of blood donors (who are very unrepresentative in terms of factors that would be important confounders in conventional epidemiological studies, such as health related behaviors and social class) and participants in the 1958 birth cohort all of the same age, which in some cases barely overlapped with the age of the cases.
However, such study designs are not appropriate for epigenetic epidemiology, as confounding, bias, and reverse causation are all serious problems when studying phenotypic exposures. It is important that the successes of genetic epidemiology are not translated into failures for epigenetic epidemiology [1,5,45]. Prospective studies are the ideal type of study, including documented exposure (epigenetic) measures collected before the outcomes and temporal changes, detailed assessment of confounding factors, and consideration of measurement error. Currently, the effect sizes of associations in epi genetic studies are poorly delineated, but it is likely that, unlike the situation in the early days of molecular genetic epidemiology, the problem will not be one of relatively few robust associations, but rather many real obser vational associations will exist and the issue will be the separation of causal associations from those generated by confounding and bias. Various methods that have been developed to strengthen causality in conventional epide miology including collaborative analysis of multiple cohorts in which confounding structures differ [46], comparisons of plausible and implausible associations [47,48], and the use of instrumental variables [47] can be applied to epigenetic epidemiology studies.
An instrumental variables method that uses germline genetic variants as the instruments Mendelian randomi zation is increasingly used to strengthen causality with respect to environmentally modifiable exposures for which genetic variants can serve as proxy measures [4951]. Mendelian randomization can be extended to the investigation of epigenetic profiles as the potentially modifiable exposure. This method 'two step epigenetic Mendelian randomization' is currently under develop ment, and details can be found elsewhere [5,52].
A further complexity of epigenetic studies is the tissue specific nature of epigenetic patterns. Given that they are integrally involved in the process of cell and tissue differentiation, it is no surprise that epigenetic patterns differ between tissue sources. Genetic comparisons within and between studies can be made using a variety of sources of DNA to generate genotype data; however, this is not the case in an epigenetic context. Population based studies often have to rely on easily accessible DNA sources (such as blood, saliva, buccal cells; Table 1). These serve as a surrogate for the target tissue involved in the disease of interest, but there is inevitable hetero geneity in both specific cell type represented and sample processing, which may bias epigenetic measurement (see the section below on data analysis considerations). Despite these limitations, epigenetic epidemiological studies are emerging and include strategies such as Mendelian randomization approaches [53] or intertissue comparisons [15] to interrogate the functional relevance and casual nature of observations.

Inter-generational epigenetic studies
Familybased sampling of both siblings and multiple generations can have particular value in epigenetic studies. The fact that epigenetic states are often estab lished in early (in particular antenatal) development makes birth cohorts with recruitment and sample collec tion from pregnant women and sample collection on offspring from birth onwards of particular value [26,27]. There is considerable interest in the role of epigenetic mechanism in the developmental origins of adult disease, to which longitudinal cohort studies are making a valuable contribution [4,5359].

Data analysis considerations
Most research undertaking longitudinal analysis of mole cular biomarker data assumes that there are predictable biological changes over time associated with a given exposure or disease process. However, in the context of epigenetic studies, change over time can be due to technical [60] or genetic factors [61], tissue type [62,63], changes with normal aging, and stochastic changes [64].
These sources of data 'noise' threaten the detection of the biological signal of interest. Thus, as is often the case, the first and most critical step to performing longitudinal DNA methylation analysis is careful study design and data collection with meticulous recording of technical factors and factors that vary between people. Given that data collection may occur months, years or even decades apart, the awareness and/or control of such sources of variability are paramount to making valid conclusions regarding withinindividual changes over time as it may be impossible to account for these factors after the fact. Preprocessing of data is often necessary to generate comparable data from samples between and within individuals over time. International initiatives to address and reach consensus on such issues are in progress [65]. Equally important is that many of these methods seek to optimize the signaltonoise ratio. These two considera tions are critical to generating valid and reproducible results. Prudent use of preprocessing that matches the study design and data, and experimentation with several different methods are strongly encouraged. In addition, the threat of timevarying artifacts masquerading as biological signal is constantly present in longitudinal studies. This possibility should be formally tested as an automatic addition to the primary study hypothesis.
An example of a 'noise' source that is just beginning to be understood is the role of genetic factors in determining the degree of variability in DNA methylation over time. This is suggested by familial clustering of DNA methy la tion variability over time [61]. From the perspective of individual loci, there is also evidence of CpG site dependent differential stability [15]. This indicates that loci should be carefully selected that demonstrate greater inter than intraindividual variation over time. The mecha nisms underlying this are unknown but could reasonably be related to overlying genetic architecture (for example, interaction with other epigenetic marks and possibly even the DNA itself ) or the cellular milieu, as suggested by tissuespecific difference in stability in the same loci [63]. With the success of nextgeneration sequencing and its falling costs, we can look forward to a clearer view of the effect of genetic factors on DNA methylation and timedependent variability.
As alluded to earlier, the vast majority of longitudinal cohort studies that are in a position to consider including epigenetic assessment have used biological specimens collected from peripheral blood. Reliance on leukocyte DNA extracted from peripheral blood introduces a potential source of measurement error [66]. Given the labile nature of leukocyte subtype populations over time, this variation may make an important contribution to intraindividual changes in DNA methylation. For instance, shifts in leukocyte populations can occur as a result of normal development and aging, inflammation from infectious, rheumatological, or oncological diseases, or normal response to medications (such as nonsteroidal antiinflammatory drugs). The most definitive solution is to isolate cell types (for example, through magnetic activated or fluorescenceactivated cell sorting), so as to perform comparisons within relatively homogenous leukocyte populations. However, this is possible only with freshly collected samples; one of the advantages of prospective longitudinal studies is the potential to collect appropriate samples relevant for epigenetic studies.
When analysis of relatively homogeneous cell types was unavailable, Zhu and colleagues [67] used total and differ ential leukocyte count (from a sample drawn con current with the methylation sample) to control for this variation in regression models. These researchers found that the proportion of leukocyte cell types correlated with levels of LINE1 methylation. Importantly though, statistical adjustment for this did not alter the association between LINE1 and Alu methylation levels and individual characteristics (age, gender, smoking habits, alcohol intake, and body mass index). Candidate gene studies of methylation have reached similar conclusions [15,16]. This could mean that leukocyte populations contribute a negligible amount of variance relative to the specified model factors. Alternatively, it may be that controlling for leukocyte population in this manner inadequately captures the effect of this noise. The possi bility that using the direct measure of an unwanted variable in a regression equation may suboptimally reduce noise was explored by Teschendorff and colleagues [60]. Using Illumina HumanMethylation27 BeadChip data, they proposed a variation of surrogate variable analysis in which confounders are modeled as statistically inde pendent components. Using these components instead of the original measures in regression analysis, they found a stronger association between methylation of Polycomb family gene loci and their phenotype of interest, age. From this, they concluded that the effect of confounders on the DNA methylation data was better represented by independent components than the original covariates.
Lastly, in cases where no information on cell counts is available, a potential solution may arise from the DNA methylation data itself. Such a possibility is presented by Houseman and colleagues through their software methylSpectrum [68]. The authors propose an algorithm to infer the contribution of different leukocyte sub populations to whole blood DNA methylation patterns. This software is not designed to examine changes over time and requires a suitable reference sample from which to make inferences, which would reasonably require multiple ageappropriate references in a longitudinal study setting.
In summary, we need formal comparisons of these methods in heterogeneous and homogeneous samples from the same specimen. International efforts to create reference epigenomes from homogeneous cell samples will be highly beneficial [65]. However, variation due to cellular and tissue heterogeneity is just one example of the wide breadth of issues regarding noise that require detailed and systematic study.

Modeling epigenetic change over time
There are several issues that need to be considered when analyzing epigenetic change over time, such as the unit of DNA methylation change under examination (Box 1) and the analytic technique. The unit of analysis must consider several issues. For example, how is DNA methylation measured? What is the question under investigation? Is the research focused on testing sitespecific changes in DNA methylation related to exposures and/or outcomes or is it seeking to explore a network of gene regulation? What type of a priori information is available? How does this information contribute to understanding of error or covariance of methylation measurements? Are individuals compared using categorical or continuous variables?
Guided by the selected unit of DNA methylation change, we now turn to examples of modeling intraindividual variation over time that is due to disease and/or environ mental factors. The selection of an appropriate modeling technique has important implications for study power and calculations of statistical significance. We limit this discussion to longitudinal studies with three or more time points, as two time points can at most infer a differ ence rather than the nature of change. Much of this work is borrowed from other fields, particularly gene expres sion studies, and uses datadriven or knowledgedriven techniques, or combinations of both.
Several techniques use comparisons between two groups (such as controls versus cases) to determine differential time courses [69,70]. Some of these methods can be extended to comparisons between more than two groups (for example, [71]). An alternative to this individual based approach is to find time course patterns that distin guish one group of individuals from another (for example, [72,73]). Methods that capitalize on other biological knowledge (such as genomes, transcriptomes, or nucleo somes) may allow us to better infer the nature of methylation in the context of how functional regulation of the genome relates to exposures or disease processes. This is especially powerful to detect signals that are expected to be subtle but consistent among jointly regu lated loci [74]. An example is longitudinal gene set analysis [75] using annotations from databases such as Gene Ontology. The parallel analysis of different sources of highthroughput data has so far only been explored in crosssectional methylation studies but could in theory be applied to longitudinal analysis. However, such longi tudinal analysis will require advanced multidimensional techniques (Box 2). These techniques require pre processed data that are relatively free of noise. Another approach may use data reduction techniques to extract meaningful features from data noise while simultaneously considering the timevarying nature of DNA methylation. For example, groupindependent component analysis with temporal concatenation of microarray data would assume that there are common sites of epigenetic activity but that the course of change may be different for each individual. Most experience in this type of technique comes from the analysis of neuroimaging data, where the goal is to uncover areas of the brain that are activated similarly among individuals in an experimental group over time [76]. The translation of such ideas to molecular data, which often have far lower temporal resolution but higher 'spatial' resolution (gene loci as opposed to areas of the brain), would be a challenging but also potentially promising avenue.

The promise of epigenetic studies of longitudinal cohorts
Future longitudinal epigenetic studies will undoubtedly integrate greater levels of genomic, biologic and/or phenomic information. For example, our expanding know ledge of factors influencing chromatin architecture may soon allow the analysis of methylation marks within context of the broader chromatin state. Examples of such data are nucleosome mapping [77], histone modifications [78], and chromosome conformation capture [79]. The influence of the underlying and overlying chromatin archi tecture (interaction with protein, RNA, and DNA primary and secondary 'structure' [80]) on differential locus stability over time remains to be elucidated. Analy sis of DNA methylation is clearly only scratching the surface of the epigenetic information that regulates gene expression, but longitudinal cohort studies provide a tractable opportunity to contribute to our knowledge base in this area and, as our understanding of the wider epigenome improves, additional epigenetic features may also be added to such studies.
Increasingly, studies are pushing to provide a broader mechanistic picture of cellular function and regulation by juxtaposing data from two or more kinds of high throughput data [81,82]. So far, these data are often extracted from different materials or individuals (such as DNA methylation from whole blood and RNA from cell culture). This limits interpretation of functional rele vance. However, advances in biotechnology that reduce the amount of specimen required and increase automa tion, in conjunction with falling costs, are likely to over come this problem. Biobanked samples, such as plasma, DNA, and RNA from longitudinal cohorts, could make a valuable contribution to developments in this area. Furthermore, the development of nested recall studies for intensive phenotyping within established cohorts will greatly enhance research opportunities in this area.
As multidimensional datasets evolve and the ability to mine the information within them improves, it will be imperative that this information is made as accessible as possible to the wider scientific community. Although it is currently possible to access some information relating epigenetic data to common genetic variation and gene expression, providing an integrative approach, this is not available at multiple time points. Longitudinal studies can offer considerable added value in these settings and profiling using a comprehensive range of highthrough put methods can be overlaid on a wealth of exposure and phenotypic data, allowing researchers to explore specific hypotheses in silico and thus helping to prioritize resources for more detailed investigations.
In summary, longitudinal cohorts can offer a great deal in the context of epigenetic epidemiology, including identification of the major determinants of epigenetic variation in populations and a better understanding of the relationship between genetic and epigenetic variation. They provide an unprecedented opportunity to increase our understanding of the dynamic nature of epigenetic patterns and how changes occur in response to a wide range of environmental, lifestyle, and behavioral factors. Populationbased studies will improve our knowledge of the extent and topography of interindividual variation in epigenetic patterns and permit assessment of effect sizes of shifts in epigenetic patterns on healthrelated • A family of genes of known biological or clinical importance (such as those previously known to show exposure-related differential methylation) • A group of functionally related genes (for example, as identified by Gene Ontology or Kyoto Encyclopedia of Genes and Genomes (KEGG) terms) • A network of co-regulated genes (for example, using intersection with concurrent gene expression data or from previous literature) • Genes related by their linear proximity on the DNA strand (such as regional grouping, as done to examine differential methylation between and within individuals [70]) • Genes related to the overlying chromatin architecture (such as knowledge of nucleosome position or histone modifications) • Genes that show similar patterns of change (for example, gene curve [71]) outcomes. A wealth of statistical approaches can be borrowed and adapted from related fields and be applied to longitudinal epigenetic analysis an area of bio statistics that is likely to grow exponentially as high throughput datasets become increasingly multidimen sional. Insights into the temporal relationship between changes in epigenetic patterns and functional and health related outcomes that can be gleaned from longitudinal studies will assist in defining causality. This, and other epidemiological methods to strengthen causal inference, will contribute to the identification of predictive epi genetic biomarkers and modifiable targets for intervention. The ultimate goal of observational data generated in epidemiological investigations is to feed forward into clinical practice or public health. There is already evidence of translation of longitudinal biological data to clinical applications [83]. The incorporation of epigenetic biomarkers to enhance clinical tools for prediction and prognosis is beginning to emerge [5] (Table 2), and longitudinal cohorts will undoubtedly help in this domain.

Box 2: Longitudinal modeling strategies for high-dimensional data
Many techniques determine differential time courses based on comparison of two groups of variables (for example, [69,70,[84][85][86]). When there are more than two groups, Yuan and colleagues [71] have demonstrated the utility of their method using hidden Markov models. Multi-group comparisons are also possible; Yuan and colleagues have demonstrated the utility of hidden Markov models to classify genes based upon their temporal expression patterns, which, rather than ignoring, takes advantage of the information contained in time course data. If no groups are present, an alternative is to group genes that show similar temporal patterns (for example, [72]). Another approach is to group genes using a priori knowledge of biological similarities and reduce the amount of multiple comparisons. Using Gene Ontology annotation to group 'functionally' related genes, Zhang et al. [75] developed a non-parametric longitudinal gene set analysis of gene expression data to detect time-exposure interaction effects. This method is suitable for unbalanced data with missing time points. It is also appropriate for heteroscedastic variance (where variance is uneven across a given data distribution) and non-normal data distributions.
Another consideration is the anticipated type of time course. If a cyclical pattern is expected -for instance, in the study of circadian rhythms or cell cycles -Li et al. [73] propose functional clustering using an autoregressive moving-average process. If the goal is to identify groups of co-expressed genes showing gradual changes over time that may be linked to disease progression, Qiu et al. [87] have developed a method to study gene expression in cancer tissue at various stages of malignant transformation, which may be applicable to epigenetic data.
Units that consider genes as groups or networks may require a transition from viewing DNA methylation data as a two-dimensional entity (such as disease group and time) to a three-dimensional one (such as disease group, gene locus and time), or even data 'blocks' with greater dimensions. The family of matrix and tensor decompositions (such as independent component analysis, canonical correlation analysis, non-negative tensor factorization, and canonical-decomposition/parallel factor analysis) used in areas such as psychometrics and chemometrics have been proposed as powerful representations of biological multi-dimensional data [88,89]. Translation of such methods to DNA methylation is sure to follow.
Although having multiple time points is advantageous for several reasons, a complication is that similar patterns of change in any group of people can start at different times (such as onset of puberty). This may obscure detection of meaningful but overlapping patterns. This can be unraveled using methods that account for lag between individuals, such as by using parallel factor analysis-related models [90] or spline-based models [91].