Normalization of two-channel microarrays accounting for experimental design and intensity-dependent relationships

eCADS is a new method for multiple array normalization of two-channel microarrays that takes into account general experimental designs and intensity-dependent relationships and allows for a more efficient dye-swap design that requires only one array per sample pair.

The new dye functions d * (x * il ) + δ * k (x * il ) are also monotone-increasing, and preserving this monotonicity is more important than estimating the true dye functions.These definitions also preserve the constraints 2 k=1 δ * k (x) = 0 and n j=1 a * j (x) = 0 for any argument x.We can therefore translate the model in equation (( 2)) from the main document into where the second equality follows by the definition of d * .We proceed with the two-stage approach: (i) estimate the x * il , (ii) fit model (a) using the x * il as plug-in estimates of the x * il .

Estimating the warped RNA amounts
In general, to estimate the x * il , we use fitted values from gene-specific ANOVA or regression models.One simple model choice is based on ANOVA model in equation ((1)) of the main document: The spot effects a ij are not necessary for our purposes here and are excluded from the model.
Additional covariates or sources of bias could be added to the gene-specific model as necessary.
Recall that x * il = d(x il ) in the eCADS model corresponds to µ i + t il in the ANOVA model.Hence, we estimate the x * il with the fitted values μi + til from model (b).The estimated "warped RNA amounts" can thus be thought of as group-specific means, adjusted gene-by-gene for bias.
We note that in our previous work on the CADS method (Dabney & Storey 2007), we defined all intensity-dependent functions to be in terms of subject-specific RNA amounts, x ijl .Since CADS targets technical dye-swap experiments, where comparisons are made between paired arrays, it is natural to think of its arguments as being subject-specific.In the present setting, we could also use subject-specific RNA amounts.This requires some further assumptions.However, importantly, the choice between population-or subject-specific RNA amounts does not affect the targets of interest, d(x il ), and hence does not affect the operating characteristics stated here for eCADS.For details, see Dabney (2006).

Basis matrix representation of the fANOVA model
Suppose we can write the component functions of (a) in terms of basis matrices: Here, for example, ] is the q δ -dimensional basis component (a q δ -dimensional row-vector) for the dye functions at the ith gene, evaluated at RNA amounts specific to group l.We can similarly define δ * k (x * l ) to be the vector-valued function with ith component equal to δ * k (x * il ), etc.This allows us to write where, for example, T is now a m × q δ matrix.Now group the y ijkl into single-channel chunks, y h , h = 1, 2, . . ., 2n.Let Z be the model matrix describing the experiment, as defined in the section Parameterizing the model.Specifically, let Z be the 2n × (n + p + 2) matrix with hth row equal to z g Here, z g l h is a scalar indicator of whether channel h comes from group l (1 for yes, 0 for no), l = 1, 2, . . ., p. Similarly, z d 1 h and z d 2 h indicate whether channel h was labeled with the red or green dyes, respectively, and z a j h indicates whether the hth channel profile comes from array j, j = 1, 2, . . ., n. Define B al z g l h z a j h , k = 1, 2, j = 1, 2, . . ., n, h = 1, 2, . . ., 2n.We can then rewrite (a) as where the G components are the corresponding stacked versions of the G h defining H h .We similarly define G x as the stacked version of the G xh .This gives

Least-squares solutions
Our estimation goal is to minimize the least-squares criterion with respect to θ subject to the constraints 2 k=1 β δ k = 0 and n j=1 β a j = 0.This can be done by replacing β δ 2 with −β δ 1 and β an with − n−1 j=1 β a j in equation (c), leading to Here, Standard least-squares theory leads to the estimate

eCADS preserves differential expression relationships
eCADS can be applied to any valid experimental design.For optimal results, however, the design should be balanced with respect to comparison group.When this balance holds, the gene-specific sample averages of the eCADS-normalized data equal the quantities of interest (d(x il )) in expectation.Thus, since d(x il ) is a monotone function of the true RNA amounts, the expected value of any test statistic based on sample averages that compares the expression level of a gene in different groups has the same sign as the true difference in RNA amount.This means that, in expectation, null genes will be called null, overexpressed genes called overexpressed, and underexpressed genes called underexpressed.Based on simulation work, minor imbalances do not affect this result.A detailed proof follows.
Let Z be the model matrix describing the experiment.Denote the hth row of Z by z h = [z gh z dh z ah z ch z bh ], h = 1, 2, . . ., 2n.Here, ) T , and The component z ch represents factors of interest, such as confounders, while the component z bh represents biases additional to the dyes and arrays used.To be general, suppose there are C additional factors of interest, where the rth factor of interest has p Cr levels, r = 1, 2, . . ., C. We associate with the uth level of the rth factor of interest the function c ru and assume that p Cr u=1 c ru = 0, r = 1, 2, . . ., C. Similarly, suppose there are B additional bias terms, where the sth bias factor has p Bs levels, s = 1, 2, . . ., B. We associate with the vth level of the sth additional bias factor the function b sv and assume that p Bs v=1 b sv = 0, s = 1, 2, . . ., B. Let ỹil be the sample average of the eCADS-normalized data for gene i in comparison group l.
Since we are assuming that test statistics will be based on sample averages, and since the factors of interest are balanced with respect to comparison group, it is sufficient to consider the test statistic t i = ỹil − ỹil for comparing the expression level of gene i in groups l and l .Let n δ k l be the number of times group l is labeled with dye k, and let n a j l be an indicator of whether group l is on array j.Similarly, let n crul be the number of times group l takes level u of the rth additional factor of interest.Finally, let n bsvl be the number of times group l takes level v of the sth additional bias factor.We can write ỹil If there is balance with respect to comparison group, then n δ 1 l = n δ 2 l , n c 11 l = . . .between two groups were made, then each group appears on every array, and n a 1 l = . . .= n anl .For indirect comparisons or experiments with more than two groups, this will not be the case.However, in these cases, we can further assume that array functions sum to zero within each relevant subset of arrays.In practice, this assumption is not necessary, since array effects are expected to be minor in comparison to dye effects; any residual array effect will not change the monotonicity of d.Based on the sum-to-zero constraints on the model terms, we therefore have ỹil = x * il + ¯ il = d(x il ) + ¯ il .A similar result holds for group l , so that E(T i ) = d(x il ) − d(x il ).The average dye function d is monotone increasing, and so = n c 1p C1 l , . .., n c C1 l = . . .= n c Cp CC l , n b 11 l = . . .= n b 1p B1 l , . .., and n b B1 l = . . .= n b Bp BB l .If direct comparisons