quantro: a data-driven approach to guide the choice of an appropriate normalization method

Normalization is an essential step in the analysis of high-throughput data. Multi-sample global normalization methods, such as quantile normalization, have been successfully used to remove technical variation. However, these methods rely on the assumption that observed global changes across samples are due to unwanted technical variability. Applying global normalization methods has the potential to remove biologically driven variation. Currently, it is up to the subject matter experts to determine if the stated assumptions are appropriate. Here, we propose a data-driven alternative. We demonstrate the utility of our method (quantro) through examples and simulations. A software implementation is available from http://www.bioconductor.org/packages/release/bioc/html/quantro.html. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0679-0) contains supplementary material, which is available to authorized users.

1 quantro: An R-package to test for global differences in distributions across groups The quantro R-package can be used to test for global differences in distributions between groups to guide the choice of whether it is appropriate to use global normalization methods, such as quantile normalization. Our method uses the raw unprocessed high-throughput data and computes a test statistic to compare the variability of distributions within each group to the variability of distributions between the groups. If the variability between the groups is sufficiently larger than the variability within each group, then there may be global differences between the groups of distributions suggesting quantile normalization may not appropriate, depending on the type and source of variation.
The main function quantro() will perform two tests: 1. An ANOVA to test if the medians of the distributions are different across groups. Differences across groups could be attributed to unwanted technical variation (such as batch effects) or real global biological variation. This is a helpful step for the user to verify if there is any technical variation unaccounted for.

A test for global differences between the distributions across groups which returns a test statistic called
quantroStat. This test statistic is a ratio of two variances (similar to the idea of ANOVA): the variability of the distributions within groups relative to the variability between groups. If the variability between groups is sufficiently larger than the variability within groups, then this suggests global adjustment methods may not be appropriate. As a default, we perform this test on the median normalized data, but the user may change this option.

Deriving the test statistic F quantro
Assume we have a set of n T samples representing K groups (e.g. K = 2 if case/control comparison). Within the k th group, assume there are n k samples. We let each sample have N observations. Assume the distribution representing the i th sample in the k th group (F ik ) has some common distribution (F k ). Using the raw observed data, we apply a median normalization to the set of samples by removing each sample median. Let the median normalized data be given by X where (X ik ) j represents the j th observation (row) from the i th individual in the k th group (column).
We assume (X ik ) j ∼ F ik where E[(X ik ) j ] = µ k and V ar[(X ik ) j ] = σ 2 and define F −1 ik (u) as the quantile function (or inverse distribution function) of X ik where u ∈ [0, 1]. Consider whereF −1 ·k (u) represents the u th quantile averaged across samples in the k th group andF −1 ·· (u) represents the u th quantile averaged across all samples and groups.
Under the null hypothesis of no global differences in distributions between groups, we formally define our null and alternative hypotheses in terms for the quantile functions for each of K groups: To quantify the variability within and between groups of distributions, we define the "total variance" as the sum of squared differences (SS total ) between F −1 ik andF −1 ·· using Mallow's distance. Mallow's distance [1] (also more generally known as Earth mover's distance or the Wasserstein metric) is defined as the distance between two probability distributions (say F , G ∈ R) over a region where F −1 and G −1 represent the quantile functions corresponding to the distributions F and G. Here, we show the "total variance" (or sum of squares) of all the distributions (SS total ) can be written as the sum of the "variance between groups" (SS between ) and the "variance within groups" (SS within ): = SS between + SS within (7) where the cross product terms cancel out because We propose using a data-driven test statistic, referred to as F quantro , to test for global differences in distributions across groups. This test statistic is a ratio of the variability in the distributions between groups (M S between ) to the to the variability in the distributions within groups (M S within ): F quantro = M S between M S within = SS between /(K − 1) SS within /(n T − K) (8) 1.2 Assessing the statistical significance of F quantro To assess statistical significance, we use permutation testing. We permute the variable defining the group level information B times and re-calculate the test statistic for each permuted sampled (F b quantro ). The p-value from the permutation test is calculated as Using some α significance level, if p ≥ α, then we fail to reject H 0 . If p < α, then we would reject H 0 .
2 Description of high-throughput data used To investigate global differences in distributions between groups of samples from high-throughput data sets, we considered several publicly available gene expression and DNA methylation data sets. Table 1 contains a list of the data sets used for this analysis. We considered several experimental design scenarios such as comparing samples across tissues, populations, cell types, and outcomes such as 'normal' and 'tumor' samples. We used a significance level of α = 0.05, but refer the reader to the simulation studies in Section 4 for a more detailed discussion on other choices for α. pickrellRNASeq. We considered a study originally performed to identify expression quantitative trait loci (eQTLs) using RNA-Seq data [2]. For this analysis, we considered one of the eQTLs identified in the YRI population (rs7639979) which is discussed in [2]. This eQTL stratifies the n = 69 samples in the YRI population by the genotypes GG (n = 18), GA (n = 32), AA (n = 15) and NN (n = 4). We removed the four samples with the missing NN genotype for this analysis. The densities and box plots of the rlogTransformation counts are shown in Figure 1 and colored by genotype: GG (blue), GA (green) and AA (red). We tested for global differences in the distributions across the groups stratified by genotype using quantro. We assessed the statistical significance of the test statistic (F quantro = 0.376) using permutation testing and report there were no global differences detected at the α = 0.05 level between the distributions of the groups (p = 0.917).  The plots use n = 65 samples from the groups based on the eQTL identified in the YRI population (rs7639979) and colored by by genotype: GG (blue), GA (green) and AA (red). Using quantro, we report no global differences detected at the α = 0.05 level between the distributions using the eQTL status (p = 0.917).
mouseStrainsRNASeq. We considered a study originally performed to compare the gene expression of two inbred mouse strains using RNA-Seq data [4]. For this analysis, we considered n = 21 samples from the two inbred mouse strains: B6 strain (n = 10) and D2 strain (n = 11). The densities and box plots of the rlogTransformation counts are shown in Figure 2 and colored by mouse strain: B6 (green) and D2 (red). We tested for global differences in the distributions between the mouse strains using quantro. We assessed the statistical significance of the test statistic (F quantro = 1.215) using permutation testing and report there were no global differences detected at the α = 0.05 level between the distributions of mouse strains (p = 0.245).  The plots use n = 21 samples colored by by mouse strain: B6 strain (green) and D2 strain (red). Using quantro, we report no global differences detected at the α = 0.05 level between the two mouse strains (p = 0.245).

Microarrays
We used the affy R/Bioconductor package [14] to analyze Affymetrix GeneChip arrays. We extract the raw Perfect Match (PM) probes from the original CEL files.
alveolarSmokingAffyData. We examined 45 Affymetrix Gene Chip arrays (GEO accession GSE2125) [5] which compared the gene expression of alveolar macrophages of n = 45 from 15 smokers, 15 nonsmokers, and 15 subjects with asthma (disease control). The densities and box plots of the raw PM values are shown in Figure 3 and colored by disease status: Nonsmoking (green), Smoking (red), Asthma (blue). We tested for global differences in the distributions between the three groups using quantro. We assessed the statistical significance of the test statistic (F quantro = 0.601) using permutation testing and report there were no global differences detected at the α = 0.05 level between the groups (p = 0.562).  lungCOPDAffyData. We examined 238 Affymetrix Human Gene 1.0 ST arrays (GEO accession GSE37147) [6] which compared the gene expression of bronchial brushings from n = 87 samples with chronic obstructive pulmonary disease (COPD) and n = 151 samples without COPD. The densities and box plots of the raw PM values are shown in Figure 4 and colored by disease status: No COPD (green), COPD (red). We tested for global differences in the distributions between the two groups using quantro. We assessed the statistical significance of the test statistic (F quantro = 1.45) using permutation testing and report there were no global differences detected at the α = 0.05 level between the groups (p = 0.218).  brainParkinsonsAffyData. We examined 22 Affymetrix arrays (GEO accession GSE19587) [7] which compared gene expression from two regions in the brain using control patients (n = 10) and patients diagnosed with Parkinson's disease (n = 12). The two regions of the brain compared were Dorsal Motor Nucleus of the Vagus (DMNV) and Inferior Olivary Nucleus (ION). The densities and box plots of the raw PM values are shown in Figure 5 and colored by region in the brain: DMNV (green) and ION (red). We tested for global differences in the distributions between the two regions of the brain in each set of samples using quantro. We assessed the statistical significance of the test statistic (F quantro = 3.097 (only control samples), F quantro = 1.441 (only Parkinson's samples)) using permutation testing. We report there were no global differences detected at the α = 0.05 level between the distributions in both the control samples (p = 0.119) and in the Parkinson's samples (p = 0.264). We note there is a large amount of variation within each group (both in the controls and in the Parkinson's samples). Therefore, quantro has a limited amount of power to detect any global differences because the variation is so high within each group and the sample size is small for each group.   [7]. The samples are colored by region in the brain: DMNV (green) and ION (red). Using quantro, we report there are no global differences detected at the α = 0.05 level between the distributions of the DMNV and ION samples in both the control samples (p = 0.119) and in the Parkinson's samples (p = 0.264).
These samples were from multiple GEO data sets with the largest number of normal brain samples (GSE17612, GSE21935) and normal liver samples (GSE14668, GSE29721, GSE38941), respectively. We investigated if there were global differences in the distributions between the two tissues. The densities and box plots of the raw PM values are shown in Figure 6 and colored by tissue: brain (green) and liver (red). The shades of green and red represent the different GEO data sets (i.e. different batches) within each tissue. We tested for global differences in the distributions between the tissues using quantro. We assessed the statistical significance of the test statistic (F quantro = 7.373) using permutation testing and report there are global differences between the distributions of the brain and liver tissues (p = 0.004).    We tested for global differences in the distributions between the groups using quantro. We assessed the statistical significance of the test statistic (F quantro = 308.8) using permutation testing and report there are global differences between the distributions of the normal and tumor breast samples (p < 0.001). prostateCancerAffyData. We extracted 167 Affymetrix prostate samples representing normal (n = 85) and tumor (n = 82) samples. These samples were from GEO data sets with the largest number of normal prostate samples (GSE17951, GSE32448) and tumor prostate samples (GSE2109). We investigated if there were global differences in the distributions between the normal and tumor samples. The densities and box plots of the raw PM values are shown in Figure 9 and colored by tumor status: normal (green) and tumor (red). The shades of green and red represent the different GEO data sets (i.e. different batches) within each tumor status. We tested for global differences in the distributions between the groups using quantro. We assessed the statistical significance of the test statistic (F quantro = 16.95) using permutation testing and report there are global differences between the distributions of the normal and tumor prostate samples (p < 0.001).  thyroidCancerAffyData. We extracted 98 Affymetrix thyroid samples representing normal (n = 65) and tumor (n = 33) samples. These samples were from GEO data sets with the largest number of normal thyroid samples (GSE29265, GSE33630) and tumor thyroid samples (GSE2109). We investigated if there were global differences in the distributions between the normal and tumor samples. The densities and box plots of the raw PM values are shown in Figure 10 and colored by tumor status: normal (green) and tumor (red). The shades of green and red represent the different GEO data sets (i.e. different batches) within each tumor status. We tested for global differences in the distributions between the groups using quantro. We assessed the statistical significance of the test statistic (F quantro = 19.94) using permutation testing report there are global differences between the distributions of the normal and tumor thyroid samples (p < 0.001).  stomachCancerAffyData. We investigated 82 Affymetrix stomach samples representing normal (n = 31) and tumor (n = 51) samples. These samples were from GEO data sets with the largest number of normal stomach samples (GSE13911) and tumor stomach samples (GSE13911, GSE2109). We investigated if there were global differences in the distributions between the normal and tumor samples. The densities and box plots of the raw PM values are shown in Figure 11 and colored by tumor status: normal (green) and tumor (red). The shades of green and red represent the different GEO data sets (i.e. different batches) within each tumor status. We tested for global differences in the distributions between the groups using quantro. We assessed the statistical significance of the test statistic (F quantro = 11.81) using permutation testing and report there are global differences between the distributions of the normal and tumor stomach samples (p < 0.001). Because there were both normal and tumor stomach samples within the same GEO data set (GSE13911), we tested for global differences using only these n = 69 samples in Figure 11. Again, we report global differences (F quantro = 11.46) between the normal and tumor stomach samples (p < 0.001).   Figure 12 and colored by tumor status: normal (green) and tumor (red). The shades of green and red represent the different GEO data sets (i.e. different batches) within each tumor status. We tested for global differences in the distributions between the groups using quantro. We assessed the statistical significance of the test statistic (F quantro = 3.502) using permutation testing and report there are global differences detected at the α = 0.05 level between the distributions of the normal and tumor liver samples (p = 0.044).  liverNAFLDAffyData. We examined 73 Affymetrix arrays (GEO accession GSE48452) [8] which compared the gene expression of liver tissues grouped into n = 14 control, n = 27 healthy obese, n = 14 steatosis, and n = 18 nash samples. The densities and box plots of the raw PM values are shown in Figure 13 and colored by disease status: Control (orange), Healthy obese (red), Steatosis (green), Nash (blue). We tested for global differences in the distributions between the two groups using quantro. We assessed the statistical significance of the test statistic (F quantro = 4.286) using permutation testing and report there are global differences between the distributions of the NAFLD samples (p = 0.004).

Microarrays
We used the minfi R/Bioconductor package [15] to analyze Illumina Infinium 450K arrays. We extracted the raw methylated and unmethylated signal using the preprocessRaw() function the cellcompMethyl data set. The raw methylated and unmethylated counts were provided as text files for the pancreaticT2DMethyl and adiposeEx-erciseMethyl data sets (http://www.ludc.med.lu.se/research-units/epigenetics-and-diabetes/published-data). The "beta"-values were computed from the methylated and unmethylated counts using the getBeta() function using Illumina's formula with an offset = 100.
adiposeExerciseMethyl. We examined the DNA methylation of 46 adipose tissue samples comparing men before and after six months of exercise [10]. This study was performed to find differentially methylated CpGs between healthy men before (n = 23) and after (n = 23) six months of exercise on the Illumina Infinium HumanMethylation450 BeadChip array. For this analysis, we used the raw beta values to test for global differences in the distributions between the two groups: before exercise, after exercise. The densities and box plots of the raw beta values are shown in Figure 14 and colored by disease status: before 6 months of exercise (green) and after 6 months of exercise (red). We tested for global differences in the distributions between the groups using quantro. We assessed the statistical significance of the test statistic (F quantro = 2.092) using permutation testing and report there were no global differences detected at the α = 0.05 level between the distributions of healthy men before and after six months of exercise (p = 0.132). BeadChip arrays in the adiposeExerciseMethyl data from [10]. The samples are colored by disease status: before (green) and after six months of exercise (red). Using quantro, we report there are no global differences detected at the α = 0.05 level between the distributions using the raw beta values of healthy men before and after six months of exercise (p = 0.132).
pancreaticT2DMethyl. We examined the DNA methylation of 49 pancreatic tissue samples from non-diabetic and type 2 diabetes (T2D) patients [11]. This study was performed to find differentially methylated CpGs in T2D patients compared to donors not diagnosed with diabetes using the Illumina Infinium HumanMethylation450 BeadChip array. The authors analyzed n = 15 T2D samples and n = 34 non-diabetic samples. For this analysis, we used the raw beta values to test for global differences in the distributions between the two groups: non-diabetic and T2D. The densities and box plots of the raw beta values are shown in Figure 15 and colored by disease status: non-diabetic (green) and T2D (red). We tested for global differences in the distributions between the groups using quantro. We assessed the statistical significance of the test statistic (F quantro = 2.854) using permutation testing and report there were no global differences detected at the α = 0.05 level between the distributions of the non-diabetic and T2D samples (p = 0.069). BeadChip arrays in the pancreaticT2DMethyl data from [11]. The samples are colored by disease status: nondiabetic (green) and T2D (red). Using quantro, we report there were no global differences detected at the α = 0.05 level between the distributions of the non-diabetic and T2D samples (p = 0.069).
cellcompMethyl. We examined the DNA methylation of 36 samples of purified cell types from whole blood [12]. This study was originally performed to determine if whole blood is a valid source for DNA methylation analysis using the  Figure 16 and colored by cell type. We tested for global differences in the distributions between the groups using quantro. We assessed the statistical significance of the test statistic (F quantro = 6.797) using permutation testing and report there are global differences between the distributions of purified cell types (p < 0.001). BeadChip arrays in the cellcompMethyl data across n = 6 purified cell types [12]. The samples are colored by cell type. Using quantro, we report there are global differences between the distributions across the six cell types using the raw beta values (p < 0.001).
3 quantroSim: An R-package to simulate gene expression and DNA methylation data To evaluate the performance of quantro and quantile normalization, we developed an R-package, referred to as quantroSim, to simulate gene expression and DNA methylation data. The models to simulate gene expression (and DNA methylation data) start by defining true biological differences (unobserved) between a set of groups that is not based on using any platform technology. Next, we define a set of platform-specific parameters that create observed technical variability and simulate a set of samples from each group. Our model controls the proportion of true differences between groups and controls of the magnitude of the technical variation from the platform technology.
To simulate gene expression and DNA methylation samples using a microarray platform technology, we use the Langmuir adsorption model [16] to model the background noise of non-specific binding, the optical noise from the florescence intensity of the scanner and the chemical saturation in the hybridization of the probes.

Simulating gene expression samples
To simulate the real gene expression in a given sample, we start with gene-level information. If at least one copy of a given RNA molecule exists in the sample, then that molecule transcribed from a given gene is being expressed.
Let n g = the number of RNA molecules expressed from g th gene where g ∈ (1, . . . , G). We assume n g follows a zero-inflated Poisson distribution: where π is the parameter representing the inflated proportion of zeros and log 2 (λ g ) ∼ N (µ g , σ 2 g ).
To compare the gene expression between K = 2 groups, we define n gk as the number of RNA molecules expressed from the g th gene in the k th group. We define pDiff as the proportion of genes that have increased gene expression between the first group (k = 1) and the second group (k = 2). Define D as the set genes that are different between the first group and the second group. We define n gk in the following way: where γ g is a fold change (e.g. γ g = 5).

Microarrays
In Affymetrix arrays each gene is represented by 11-20 probes pairs: perfect matches (PM) and mismatches (MM).
For our purposes, we only simulate PM probes. To simulate probe-level fragments, let N g = the number of probes for the g th gene N g ∼ Binomial(20, 0.6) Each RNA molecule from the g th gene in the k th group is sheared into fragments and z T gk = (z 1gk , . . . , z Ng,gk ) are the number fragments in the sample represented by the N g probes. If there are n gk RNA molecules, then z jgk ≤ n gk . Let p jg be the probability of successfully creating a matching fragment for the j th probe from the n gk RNA molecules. We assume p jg ∼ U nif (0, 1), then we can simulate z jgk ∼ Binomial(n gk , p jg ) PCR amplification is sometimes used to amplify the RNA. This is not a requirement for the simulations, but the option is available. To simulate PCR, we use probability generating functions [17]. Let q jg = probability that the RNA fragment in the j th probe in g th gene replicates itself during one PCR cycle. If there are z jgk RNA molecules before PCR, then the expected number of RNA molecules after N P CR PCR cycles is where q jg ∼ U nif (0, 1). If PCR is not used in the simulation then, x ijk = z ijk . To keep the notation simple, we drop the g th gene notation and consider just x jk . To model the saturation probe effect in microarrays, we use the Langmuir adsorption model [16]. The "intensity" of the PM probes can be modeled using where i ∈ (1, . . . , n k ) represents the sample, j ∈ (1, . . . , J) represents the probe and k ∈ (1, . . . , K) represents the group index. Let T = total number of samples.
We define the optical noise (o ijk ) to be written as a product of sample-level optical noise (o ik ) and probe-level optical noise (o j ) or o ijk = o ik * o j to allow for global shifts between samples. Both optical noise parameters are simulated using a lognormal normal distribution. Similarly, the florescence intensity from the scanner (a ijk = a ik * a j ) and the parameter (b ijk = b ik * b j ) are also a product of sample-level parameter and probe-level parameter to allow for global scaling between samples. The background noise (d ijk ), measurement error ( ijk ) are also simulated using a lognormal distribution. Table 2 contains a list of the example parameters used to simulate gene expression samples using microarrays. Each of the parameters sample-level noise (a ik ) and probe-level noise parameters (a j ) have their own hyperparameters to allow for global shifts. Similarly, b ijk = b ik * b j and o ijk = o ik * o j (optical noise) with their own set of hyperparameters.
Using the multivariate normal distribution, we simulate log 2 (a), log 2 (b) and log 2 (o) with a given set of hyperparameters (see Table 2). In the quantroSim R-package, the variance hyperparameters are referred to as siga, sigb and sigOpt in the simulateGEx() function. In quantroSim, the level of technical variation induced from the platform technology is controlled using the covariance matricies of siga, sigb and sigOpt in the function simulateGEx(). There is a vignette available in the quantroSim R-package that contains more details about the simulateGEx() function. Here we define "low" technical variation as the default parameters in the functions simulateGEx() (Table 2) and "high" technical variation as ten fold increase. An example of simulated gene expression arrays with "low" and "high" technical variation and default parameters otherwise is given in Figure 17 and 18, respectively.
To compare the DNA methylation between k = 2 groups, we define θ jk as the true proportion of methylation at the j th CpG site in the k th group. We define pDiff as the proportion of CpG sites that are different between the first group (k = 1) and the second group (k = 2) and pUp as the proportion of pDiff CpGs that change from an Now, define D as the set CpG sites that are different between the first group and the second group. We simulate z jk in the following way: where ∼ N (0, 0.01) and Then, we can compute θ jk as The expected number of methylated and unmethylated molecules at jth CpG site is x M jk = θ jk N and x U jk = (1−θ jk )N where N is a scaling factor (e.g. N = 10 6 ).

Microarrays
To model saturation probe effect, we use the Langmuir adsorption model [16]. The "intensity" of methylated and unmethylated probes observed can be modeled using where i ∈ (1, . . . , n k ) represents the sample, j ∈ (1, . . . , J) represents the CpG site (or probe) and k ∈ (1, . . . , K) represents the group index. Let T = total number of samples.
To calculate the "beta"-values which are values between 0 and 1 (where 1 is highly methylated), use where offset is a value to prevent dividing by 0 (e.g. offset = 100 which is the default from Illumina). Then we can compare the β ijk 's to the true proportion of methylation θ jk for the j th probe in the k th group.
We define the optical noise (o ijk ) to be written as a product of sample-level optical noise (o ik ) and probe-level optical noise (o j ) or o ijk = o ik * o j to allow for global shifts between samples. Both optical noise parameters are simulated using a lognormal normal distribution. Similarly, the florescence intensity from the scanner (a ijk = a ik * a j ) and the parameter (b ijk = b ik * b j ) are also a product of sample-level parameter and probe-level parameter to allow for global scaling between samples. The background noise (d ijk ), measurement error ( ijk ) are simulated using a lognormal distribution. Table 3 contains a list of the example parameters used to simulate DNA methylation samples using microarrays. Each of the parameters sample-level noise (a ik ) and probe-level noise parameters (a j ) have their own hyperparameters to allow for global shifts.
with their own set of hyperparameters.
Using the multivariate normal distribution, we simulate log 2 (a), log 2 (b) and log 2 (o) with a given set of hyperparameters (see Table 3). In the quantroSim R-package, the variance hyperparameters are referred to as siga, sigb and sigOpt in the simulateMeth() function. In quantroSim, the level of technical variation induced from the platform technology is controlled using the covariance matricies of siga, sigb and sigOpt in the function simulateMeth(). There is a vignette available in the quantroSim R-package that contains more details about the simulateMeth() function. Here we define "low" technical variation as the default parameters in the functions simulateMeth() (Table 3) and "high" technical variation as ten fold increase for siga and sigb and a two fold increase for sigOpt. An example of simulated DNA methylation arrays with "low" and "high" technical variation and default parameters otherwise is given in Figure 19 and 20, respectively.

Simulation Study: Assessing the Performance of quantro
In the previous section, we discussed the quantroSim R-package that we developed to simulate gene expression and DNA methylation data. Here, we evaluate the performance of global normalization methods in the context of targeted and global changes in distributions with the goal of detecting differentially methylated CpGs. We performed a Monte Carlo simulation study to illustrate how the use of global normalization methods, such as quantile normalization, is not always appropriate. For the simulation study, we simulate DNA methylation arrays with a goal of detecting differentially methylated CpGs, but note these results also translate for differential gene expression. We compare naively using quantile normalization to using quantro to guide the decision of using either quantile normalization or no normalization to assess the cost of using global normalization methods in the context of distributions with global differences. We performed several simulation studies to evaluate the bias, mean squared error (MSE), false discovery rate (FDR), true positive rate (TPR) and false positive rate (FPR) (see Table 4 for a description of the performance metrics).
For the following simulation studies, we simulate DNA methylation arrays with a goal of detecting differentially methylated CpGs, but note these results also translate for differential gene expression. We consider two groups with five samples (total of 10 samples). For each set of 10 simulated samples, we control pDiff (the proportion of CpGs differentially methylated between the two groups). Once a set of 10 DNA methylation samples are simulated, we process the raw "beta"-values using both quantile normalization (using the function normalize.quantiles() in the preprocessCore [18] R/Bioconductor package) and quantro. The function quantro() in the quantro R/Bioconductor package uses the F quantro test statistic and the significance level α in a permutation test to assess the statistical significance: Finally, we estimate the difference in group means between the two groups and find the top differently methylated probes using a t-test (specifically rowttests() in the genefilter [19] R/Bioconductor package for efficiency). The plots containing the results from the simulation studies were created using the ggplot2 R package [20].

Performance metrics
We considered five performance metrics to assess the performance of processing the raw data with quantile normalization compared to using quantro to decide if quantile normalization is appropriate (Table 4). In Section 3.2, we defined θ jk (10) as the true "beta"-values at the j th probe and k th group. We also defined the simulated "beta"-values β ijk for the i th sample, j th probe and k th group (11). Here, we computeβ .j1 andβ .j2 which are the "beta"-values averaged across the samples within each group at the j th probe to estimate the bias and MSE when detecting differentially methylated CpGs after using quantro or just quantile normalization.
The false discovery rate (FDR), true positive rate (TPR) and false positive rate (FPR) are computed using the curve is used to depict the relative trade-offs between TPR and FPR. We used partial area under the curve (pAUC) [21] to compare the ROC curves. Bias

Bias-Variance trade-off
In this first simulation study, we illustrate the bias-variance tradeoff of using normalization methods with and without global adjustments in the context of distributions with and without global differences. We assessed the relative bias (bias from quantro to the bias from quantile normalization) and relative mean squared error (MSE) while varying the threshold α from quantro (Section 4.2.1) and for a fixed cutoff threshold (Section 4.2.2).
We repeat the following procedure N = 1000 times: 1. Define the proportion of differentially methylated CpGs (pDiff) as 0 if no differences between groups or randomly sample pDiff from a uniform distribution with parameters a and b: U (a, b). Using the simulateMethTruth() and simulateMeth() functions in the quantroSim R-package, simulate 10 DNA methylation samples each with 10,000 CpGs where pDiff defines the proportion of the CpGs that are differentially methylated two groups (5 samples in each group). We consider two levels of technical variability "low" and "high" which are discussed in Figures 19 and 20. Default parameters are used unless stated otherwise.
2. Normalize the 10 samples using both quantro and quantile normalization where quantro uses the F quantro test statistic (8) to decide if quantile normalization is appropriate with a significance level of α (no normalization otherwise).
3. Compute the bias and MSE when using quantro and quantile normalization.
We average the bias and MSE across the N = 1000 simulations.

Bias-Variance trade-off as a function of the cutoff used by quantro
To determine which threshold should be used when assessing the statistical significance of F quantro , we compared the relative bias and relative MSE while varying the cutoff threshold used by quantro. We also considered different ranges for pDiff. If pDiff = 0, then there no differentially methylated CpGs between the two groups. As pDiff increases, the proportion of differentially methylated CpGs increases. Figure 21 gives examples of the relative bias and relative MSE under four ranges of pDiff: When there are no differences between the groups, quantile normalization reduces the bias and MSE in detecting true differences between groups of samples as it removes unwanted technical variation. As the number of differentially methylated CpGs increases, quantile normalization will remove both the unwanted technical and interesting biological variation resulting in higher bias and MSE when detecting differential methylation. In contrast, quantro reduces the bias and MSE compared to using quantile normalization because the method is able to detect when there are global differences.  Here we focus on using one significance level from quantro. Using a significance level of α = 0.05 (black line in Supplemental Figure 21), we compare the relative bias and relative MSE of quantro to quantile normalization. In Figure 22, we show that when there are global changes in distributions between the two groups, quantro reduces the bias and MSE compared to using quantile normalization because the method is able to detect when there are global differences.

Number of false discoveries
In the second simulation study, we estimate the number of false discoveries out of the top differentially methylated CpGs.
We repeat the following procedure N = 100 times:

Receiver operating characteristic curves
Here we illustrate the trade-off between the true positive rate (TPR) and the false positive rate (FPR) using an ROC curve. We measure the performance using partial area under the curve (pAUC).
We repeat the following procedure N = 1000 times:  A B

Application-specific normalization methods
If global adjustment methods are not appropriate, other methods such as application-specific methods [22] can be used. These are normalization methods where the adjustments are directly incorporated into the experiment or main analysis. Examples of these methods include the use of positive and negative control genes [23,24,22], the use of spike-in controls [25,26,27,28,29,9,30], and explicitly modeling known or unknown effects of unwanted variation in a linear model [31,32,33,34,35,36]. Previous studies have evaluated and discussed normalization methods with and without global adjustments [37,18,38,22,39], but the decision of which type of normalization method to use depends on the outcome of interest.

Impact of experimental normalization in the context of global changes in gene expression
A recent study [9] discussed the use of normalization procedures in global gene expression analysis comparing two schematics: targeted changes in gene expression and global changes in gene expression such as transcriptional amplification [40,41] or transcriptional shutdown [42]. The authors performed a gene expression experiment using  Table 1.
We investigated if there were global differences in the distributions between the low and high c-Myc samples using the raw PM values from the original CEL files for this analysis. An AffyBatch object is available as R data package at https://github.com/stephaniehicks/mycAffyData. To visualize the true biological variation in the experimentally normalized samples, we divided the raw PM values by the sample mean of the PM values across the spike-ins on the log scale. Figure 25 plots the densities of the PM values before experimental normalization and the PM values from the CEL files (after experimental normalization). We tested for global differences in the distributions between the two groups using quantro using the PM values from the original CEL files (experimentally normalized samples). We assessed the statistical significance of the test statistic (F quantro = 1.548) using permutation testing and report there were no global differences detected at the α = 0.05 level between the distributions of the groups (p = 0.318).
Not surprisingly, the authors show global normalization methods are not appropriate if the total RNA is not the same across the samples. In this case, if normalization is performed at the experimental level (introducing similar amounts of RNA into the assay from the two groups with global changes), then we suggest using control genes or spike-ins controls as no differences between the distributions will be detected ( Figure 25). However, for the great majority of studies such strategies are not available. Furthermore, if one knows a priori that most genes are differentially expressed then it is not clear why one would use these high-throughput technologies.

Biological variation:
Real global di erences from transcriptional ampli cation In this case, if normalization is performed at the experimental level by introducing similar amounts of RNA from the two groups with global differences into the assay, then we suggest to use control genes or spike-ins controls as no global differences between the distributions will be detected.