### Infinium HumanMethylation450 BeadChip

We use the following terminology, consistent with the minfi package [12]. The 450k array is available as slides consisting of 12 arrays. These arrays are arranged in a six rows by two columns layout. The scanner can process up to eight slides in a single plate. We use the standard formula *β*=*M*/(*M*+*U*+100) for estimating the percentage methylation given (un)methylation intensities *U* and *M*.

Functional normalization uses information from the 848 control probes on the 450k array, as well as the out-of-band probes discussed in Triche *et al.* [16]. These control probes are not part of the standard output from GenomeStudio, the default Illumina software. Instead we use the IDAT files from the scanner together with the open source illuminaio [49] package to access the full data from the IDAT files. This step is implemented in minfi [12]. While not part of the standard output from GenomeStudio, it is possible to access the control probe measures within this software by accessing the Control Probe Profile.

### Control probe summaries

We transform the 848 control probes, as well as the out-of-band probes [16] into 42 summary measures. The control probes contribute 38 of these 42 measures and the out-of-band probes contribute four. An example of a control probe summary is the mapping of 61 ‘C’ normalization probes to a single summary value, their mean. The out-of-band probes are the intensities of the type I probes measured in the opposite color channel from the probe design. For the 450k platform, this means 92,596 green intensities, and 178,406 red intensities that can be used to estimate background intensity, and we summarize these values into four summary measures. A full description of how the control probes and the out-of-band probes are transformed into the summary control measures is given in Additional file 1: Supplementary material.

### Functional normalization: the general framework

Functional normalization extends the idea of quantile normalization by adjusting for known covariates measuring unwanted variation. In this section, we present a general model that is not specific to methylation data. The adaptation of this general model to the 450k data is discussed in the next section. The general model is as follows. Consider **Y** _{1},…,**Y** _{
n
} high-dimensional vectors each associated with a set of scalar covariates *Z* _{i,j} with *i*=1,…,*n* indexing samples and *j*=1,…,*m* indexing covariates. Ideally these known covariates are associated with unwanted variation and unassociated with biological variation; functional normalization attempts to remove their influence. For each high-dimensional observation **Y** _{
i
}, we form the empirical quantile function for its marginal distribution, and denote it by {q}_{i}^{\text{emp}}. Quantile functions are defined on the unit interval and we use the variable *r*∈[0,1] to evaluate them pointwise, like {q}_{i}^{\text{emp}}\left(r\right). We assume the following model in pointwise form

{q}_{i}^{\text{emp}}\left(r\right)=\alpha \left(r\right)+\sum _{j=1}^{m}{Z}_{i,j}{\beta}_{j}\left(r\right)+{\epsilon}_{i}\left(r\right),

(1)

which has the functional form

{q}_{i}^{\text{emp}}=\alpha +\sum _{j=1}^{m}{Z}_{i,j}{\beta}_{j}+{\epsilon}_{i}

(2)

The parameter function *α* is the mean of the quantile functions across all samples, *β* _{
j
} are the coefficient functions associated with the covariates and *ε* _{
i
} are the error functions, which are assumed to be independent and centered around 0.

In this model, the term

\sum _{j=1}^{m}{Z}_{i,j}{\beta}_{j}

(3)

represents variation in the quantile functions explained by the covariates. By specifying known covariates that measure unwanted variation and that are not associated with a biological signal, functional normalization removes unwanted variation by regressing out the latter term. An example of a known covariate could be processing batch. In a good experimental design, processing batch will not be associated with a biological signal.

In particular, assuming we have obtained estimates {\widehat{\beta}}_{j} for *j*=1,…,*m*, we form the functional normalized quantiles by

{q}_{i}^{\text{Funnorm}}\left(r\right)={q}_{i}^{\text{emp}}\left(r\right)-\sum _{j=1}^{m}{Z}_{i,j}{\widehat{\beta}}_{j}\left(r\right)

(4)

We then transform **Y** _{
i
} into the functional normalized quantity {\stackrel{~}{\mathbf{Y}}}_{i} using the formula

{\stackrel{~}{\mathbf{Y}}}_{i}={q}_{i}^{\text{Funnorm}}\left({\left({q}_{i}^{\text{emp}}\right)}^{-1}\left({\mathbf{Y}}_{i}\right)\right)

(5)

This ensures that the marginal distribution of {\stackrel{~}{\mathbf{Y}}}_{i} has {q}_{i}^{\text{Funnorm}} as its quantile function.

We now describe how to obtain estimates {\widehat{\beta}}_{j} for *j*=1,…,*m*. Our model 1 is an example of function-on-scalar regression, described in [50]. The literature on function-on-scalar regression makes assumptions about the smoothness of the coefficient functions and uses a penalized framework because the observations appear noisy and non-smooth. In contrast, because our observations **Y** _{
i
} are high dimensional and continuous, the jumps of the empirical quantile functions are very small. This allows us to circumvent the smoothing approach used in traditional function-on-scalar regression. We use a dense grid of *H* equidistant points between 0 and 1, and we assume that *H* is much smaller than the dimension of **Y** _{
i
}. On this grid, model 1 reduces pointwise to a standard linear model. Because the empirical quantile functions *q* ^{emp}(*r*) have very small jumps, the parameter estimates of these linear models vary little between two neighboring grid points. This allows us to use *H* standard linear model fits to compute estimates \widehat{\alpha}\left(h\right) and {\widehat{\beta}}_{j}\left(h\right), *j*=1,…,*m*, with *h* being on the dense grid {*h*∈*d*/*H*:*d*=0,1,…,*H*}. We next form estimates \widehat{\alpha}\left(r\right) and {\widehat{\beta}}_{j}\left(r\right), *j*=1,…,*m*, for any *r*∈[0,1] by linear interpolation. This is much faster than the penalized function-on-scalar regression available through the refund package [51].

Importantly, in this framework, using a saturated model in which all the variation (other than the mean) is explained by the covariates results in removing all variation and is equivalent to quantile normalization. In our notation, quantile-normalized quantile functions are

{q}_{i}^{\text{quantile}}\left(r\right)=\widehat{\alpha}\left(r\right)

(6)

where \widehat{\alpha} is the mean of the empirical quantile functions. This corresponds to the maximum variation that can be removed in our model. In contrast, including no covariates makes the model comparable to no normalization at all. By choosing covariates that only measure unwanted technical variation, functional normalization will only remove the variation explained by these technical measurements and will leave biological variation intact. Functional normalization allows a sensible trade-off between not removing any technical variation at all (no normalization) and removing too much variation, including global biological variation, as can occur in quantile normalization.

### Functional normalization for 450k arrays

We apply the functional normalization model to the methylated (*M*) and unmethylated (*U*) channels separately. Since we expect the relationship between the methylation values and the control probes to differ between type I and type II probes, functional normalization is also applied separately by probe type to obtain more representative quantile distributions. We address the mapping of probes to the sex chromosomes separately; see below. This results in four separate applications of functional normalization, using the exact same covariate matrix, with more than 100,000 probes in each normalization fit. For functional normalization, we pick *H*=500 equidistant points (see notation in previous section). As covariates, we use the first *m*=2 principal components of the summary control measures as described above. We do this because the control probes are not intended to measure a biological signal since they are not designed to hybridize to genomic DNA. Our choice of *m*=2 is based on empirical observations on several data sets.

Following the ideas from quantile normalization for 450k arrays [11],[12], we normalize the mapping of probes to the sex chromosomes (11,232 and 416 probes for the X and Y chromosomes, respectively) separately from the autosomal probes. For each of the two sex chromosomes, we normalize males and females separately. For the X chromosome, we use functional normalization, and for the Y chromosome, we use quantile normalization, since the small number of probes on this chromosome violates the assumptions of functional normalization, which results in instability.

Functional normalization only removes variation in the marginal distributions of the two methylation channels associated with control probes. This preserves any biological global methylation difference between samples. We have found (see Results) that we get slightly better performance for functional normalization if we apply it to data that have been background corrected with noob [16].

### Data

**The Ontario study.** The Ontario study consists of samples from 2,200 individuals from the Ontario Familial Colon Cancer Registry [52] who had previously been genotyped in a case–control study of colorectal cancer in Ontario [53]. The majority of these samples are lymphocytes derived from whole blood. We use various subsets of this data set for different purposes.

Biospecimens and data collected from study participants were obtained with written informed consent and approval from the University of Toronto Office of Research Ethics (Protocol Reference 23212), in compliance with the WMA Declaration of Helsinki – Ethical Principles for Medical Research Involving Human Subjects.

**The Ontario-EBV data set.** Lymphocyte samples from 100 individuals from the Ontario study were transformed into immortal lymphoblastoid cell lines using the EBV transformation. We divided the 100 EBV-transformed samples into two equal-sized data sets (discovery and validation). For the discovery data set, we matched the 50 EBV-transformed samples to 50 other lymphocyte samples assayed on the same plates. For the validation data set, we matched the 50 EBV-transformed samples to 50 other lymphocyte samples assayed on different plates.

**The Ontario-Blood data set.** From the Ontario study, we first created a discovery–validation design where we expect only a small number of loci to be differentially methylated. For the discovery data set, we selected all cases and controls on three plates that showed little evidence of plate effects among the control probes, which yielded a total of 52 cases and 231 controls. For the validation data set, we selected four plates where the control probes did show evidence of a plate effect and then selected cases and controls from separate plates, to maximize the confounding effect of plate. This yielded a total of 175 cases and 163 controls.

**The Ontario-Sex data set.** Among ten plates for which the control probes demonstrated differences in distribution depending on plate, we selected 101 males from a set of five plates and 105 females from another set of five plates, attempting to maximize the confounding effect of batch on sex.

**The Ontario-Replicates data set.** Amongst the lymphocyte samples from the Ontario study, 19 samples have been assayed three times each. One replicate is a hybridization replicate and the other replicate is a bisulfite conversion replicate. The 57 samples have been assayed on 51 different slides across 11 different plates.

**The TCGA-KIRC data sets.** From TCGA, we have access to KIRC and normal samples, assayed on two different methylation platforms. We use the level 1 data, contained in IDAT files. For the 450k platform, TCGA has assayed 300 tumor samples and 160 normal samples. For the discovery set, we select 65 tumor samples and 65 matched normal samples from slides showing little variation in the control probes. These 130 samples were assayed on three plates. For the validation data set, we select the remaining 95 normal samples together with all 157 cancer samples that were part of the same TCGA batches as the 95 normal samples. These samples were spread over all nine plates, therefore maximizing potential batch effects. For the 27k platform, TCGA has assayed 219 tumor samples and 199 normal samples. There is no overlap between the individuals assayed on the 450k platform and the individuals assayed on the 27k platform.

**The TCGA-AML data sets.** Also from TCGA, we used data from 194 AML samples, where each sample was assayed twice: first on the 27K Illumina array and subsequently on the 450K array. Every sample but two has been classified according to the FAB subtype classification scheme [40], which classifies tumors into one of eight subtypes. The two unclassified samples were removed post-normalization. We use the data known as level 1, which is contained in IDAT files.

**Whole-genome bisulfite sequencing (WGBS) EBV data.** Hypomethylated blocks and small DMRs between transformed and quiescent cells were obtained from a previous study [38]. Only blocks and DMRs with a family-wise error rate equal to 0 were retained (see the reference). A total of 228,696 probes on the 450K array overlap with the blocks and DMRs.

### Data availability

The Ontario methylation data have been deposited in dbGAP under accession number [phs000779.v1.p1]. These data were available to researchers under the following constraints: (1) the use of the data is limited to research on cancer, (2) the researchers have local Institutional Review Board approval and (3) the researchers have the approval of either Colon Cancer Family Registries [54] or Mount Sinai Hospital (Toronto) Research Ethics Board. The TCGA data (KIRC and AML) are available through the TCGA Data Portal [55]. The WGBS EBV data is available through the Gene Expression Omnibus of the National Center for Biotechnology Information under the accession number [GEO:GSE49629]. Our method is available as the preprocessFunnorm function in the minfi package through the Bioconductor project [56]. The code in this package is licensed under the open-source license Artistic-2.0.

### Data processing

Data were available in the form of IDAT files from the various experiments (see above). We used minfi [12] and illuminaio [49] to parse the data and used the various normalization routines in their reference implementations (see below).

We performed the following quality control on all data sets. As recommended in Touleimat and Tost [11], for each sample we computed the percentage of loci with a detection *P* value greater than 0.01, with the intention of excluding a sample if the percentage was higher than 10%. We used the minfi [12] implementation of the detection *P* value. We also used additional quality control measures [12] and we interactively examined the arrays using the shinyMethyl package [57]; all arrays in all data sets passed our quality control.

We performed the following filtering of loci, after normalization. We removed 17,302 loci that contain a SNP with an annotated minor allele frequency greater than or equal to 1% in the CpG site itself or in the single-base extension site. We used the UCSC Common SNPs table based on dbSNP 137; this table is included in the minfi package. We removed 29,233 loci that have been shown to cross-hybridize to multiple genomic locations [43]. The total number of loci removed is 46,535, i.e. 9.6% of the array. We chose to remove these loci post-normalization as done previously [16],[58], reasoning that while these probes may lead to spurious associations, we believe they are still subject to technical variation and should therefore contain information useful for normalization.

### Comparison to normalization methods

We have compared functional normalization to the most popular normalization methods used for the 450k array. This includes the following between-array normalization methods: (1) quantile: stratified quantile normalization as proposed by Touleimat *et al.* [11] and implemented in minfi [12], (2) dasen: background adjustment and between-sample quantile normalization of *M* and *U* separately [15] and (3) noob: a background adjustment model using the out-of-band control probes followed by a dye bias correction [16], implemented in the methylumi package. We also consider two within-array normalization methods: (4) SWAN [13] and (5) BMIQ [14]. Finally, we consider (6) raw data: no normalization, i.e., we only matched up the red and the green channels with the relevant probes according to the array design (specifically, it is the output of the preprocessRaw function in minfi).

In its current implementation, noob yielded missing values for at most a couple of thousand loci (less than 1%) per array. This is based on excluding loci below an array-specific detection limit. We have discarded those loci from our performance measures, but only for the noob performance measures. In its current implementation, BMIQ produced missing values for all type II probes in five samples for the TCGA-AML data set. We have excluded these samples for our performance measures, but only for our BMIQ performance measures.

For clarity, in the figures we focus on the top-performing methods which are raw data, and quantile and noob normalization. The assessments of the other methods, dasen, BMIQ and SWAN, are available in Additional file 1: Supplementary Materials.

### Comparison to SVA

We used the reference implementation of SVA in the sva package [45]. We applied SVA to the *M* values obtained from the raw data. Surrogate variables were estimated using the iteratively re-weighted SVA algorithm [26], and were estimated separately for the discovery and validation cohorts. In the analysis of the Ontario-EBV data set, SVA found 21 and 23 surrogate variables, respectively, for the discovery and the validation cohorts. In the analysis of the Ontario-Blood data set, SVA found 18 and 21 surrogate variables, respectively, for the discovery and the validation cohorts. In the analysis of the TCGA-KIRC data set, SVA found 29 and 32 surrogate variables, respectively, for the discovery and the validation cohorts. In the analysis of the TCGA-AML data set, SVA found 24 surrogate variables.

### Comparison to RUV

The RUV-2 method was originally developed for gene expression microarrays [28]. The method involves a number of domain-specific choices. To our knowledge, there is no publicly available adaption of RUV-2 to the 450k platform, so we adapted RUV-2 to the 450k array. The core of the method is implemented in software available from a personal website [59]. As negative genes (genes not associated with the biological treatment group), we selected the raw intensities in the green and red channels of the 614 internal negative control probes available on the 450k array.

To determine the number *k* of factors to remove (see Gagnon-Bartsch and Speed [28] for details of this parameter), we followed the approach described in [28]. First, for each value *k*=0,1,…,40, we performed a differential analysis with respect to sex. Second, we considered as positive controls the probes that are known to undergo X inactivation (see section Sex validation analysis) and probes mapping to the Y chromosome. Third, for the top ranked *m*=25000, 50,000 and 100,000 probes, we counted how many of the positive control probes are present in the list. Finally, we picked the value of *k* for which these counts are maximized. The different tuning plots are presented in Additional file 1: Figure S9. The optimal *k* was 14 and 11 for the discovery and the validation cohorts of the Ontario-EBV data set, respectively. In the analysis of the Ontario-Blood data set, the optimal *k* was 0 and 3, respectively, for the discovery and the validation cohorts. In the analysis of the TCGA-KIRC data set, the optimal *k* was 36 and 5, respectively, for the discovery and the validation cohorts. In the analysis of the TCGA-AML data set, *k* was selected to be 0 (which is equivalent to raw data).

### Comparison to ComBat

We used the reference implementation of ComBat in the sva package [45]. Because ComBat cannot be applied to data sets for which the phenotype of interest is perfectly confounded with the batch variable, we could only run ComBat for the AML and KIRC data sets.

### Identification of differentially methylated positions

To identify DMPs, we used *F*-statistics from a linear model of the beta values from the array. The linear model was applied on a probe-by-probe basis. In most cases, the model included case/control status as a factor. In the 27K data, we adjusted for batch by including a plate indicator (given by TCGA) in the model.

### Discovery–validation comparisons

To measure the consistency of each normalization method at finding true DMPs, we compared results obtained on a discovery–validation split of a large data set. Comparing results between two different subsets of a large data set is an established idea and has been applied to the context of 450k normalization [14],[46]. We extended this basic idea in a novel way by introducing an *in silico* confounding of treatment (case/control status) by batch effects as follows. In a first step, we selected a set of samples to be the discovery cohort, by choosing samples where the treatment variable is not visibly confounded by plate effects. Then the validation step is achieved by selecting samples demonstrating strong potential for treatment confounding by batch, for example by choosing samples from different plates (see descriptions of the data). The extent to which it is possible to introduce such a confounding depends on the data set. In contrast to earlier work [46], we normalized the discovery and the validation cohorts separately, to mimic an independent replication experiment more realistically. The idea of creating *in silico* confounding between batch and treatment has been previously explored in the context of genomic prediction [39].

We quantified the agreement between validation and discovery in two ways: by an ROC curve and a concordance curve. For the ROC curve, we used the discovery cohort as the gold standard. Because the validation cohort is affected by a batch effect, a normalization method that is robust to batch effects will show better performance on the ROC curve. Making this ROC curve required us to choose a set of DMPs for the discovery cohort. The advantage of the ROC curve is that the plot displays immediately interpretable quantities, such as specificity and sensitivity.

For the concordance curve, we compared the top *k* DMPs from the discovery and the validation sets, and displayed the percentage of the overlap for each *k*. These curves do not require us to select a set of DMPs for the discovery cohort. Note that these curves have been previously used in the context of multiple-laboratory comparison of microarray data [60].

### Sex validation analysis

On the 450k array, 11,232 and 416 probes map to the X and Y chromosomes, respectively. Because some genes have been shown to escape X inactivation [44], we only considered genes for which the X-inactivation status is known to ensure an unbiased sex prediction. From [44], 1,678 probes undergo X-inactivation, 140 probes escape X-inactivation, and 9,414 probes have either variable or unknown status.

For the ROC curves, we defined the true positives to be the 1,678 probes undergoing X-inactivation and the probes mapping to the Y chromosome (416 probes); by removing the probes that have been shown to cross-hybridize [43], we were left with 1,877 probes. For the true negatives, we considered the 140 probes escaping X-inactivation and the autosomal probes that do not cross-hybridize. The rest of the probes were removed from the analysis.

### Sample size simulation

To assess the performance of functional normalization for different small sample sizes, we devised the following simulation scheme for the Ontario-EBV data set. First, we kept the discovery data set intact to ensure a reasonable gold standard in the discovery–validation ROC curves; we only simulated different sample sizes for the validation subset. For sample sizes *n*=10,20,30,50 and 80, we randomly chose half of the samples from the EBV-transformed samples, and the other half from the lymphocyte samples. For instance, for *n*=10 samples, we randomly picked five samples from each of the treatment groups. We repeated this subsampling *B*=100 times, which generated 100 discovery–validation ROC curves for each *n*. For a fixed *n*, we considered the mean of the *B*=100 ROC curves as well as the 0.025 and 0.975 quantiles to mimic a 95% confidence interval.

### Reproducibility

A machine-readable document detailing our analyses is available at GitHub [61].