Model formulation
A two-channel microarray experiment produces data representing true expression levels that have been translated and distorted by various technological processes. In particular, we do not observe the true total (or even relative) counts of RNA molecules. We observe some function of these quantities in the form of a fluorescence intensity, where the function we observe incorporates various technical aspects of the experiment, including systematic biases. A minimal requirement is that microarray data preserve relative expression relationships. In particular, we want to be able to detect the presence and direction of true differential expression. This can be formalized by the requirement that the sign of relative relationships be preserved.
Our proposed statistical model for the observed expression values from a microarray experiment can be motivated by considering the early ANOVA models proposed for microarrays [8, 13]. Let y
ijkl
be the observed log expression measurement, a single-channel fluorescence intensity, for gene i on array j labeled with dye k in comparison group l. We begin with the model:
y
ijkl
= μ
i
+ d
k
+ t
il
+ a
ij
+ ε
ijkl
,
i = 1,2,...,m, j = 1,2,...,n, k = 1,2, l = 1,2. For simplicity, let k = 1 indicate red dye (k = 2 green). This is the ANOVA model of [8], with some of the components collapsed together. Thus, the μ
i
represent gene-specific baseline means, d
k
the dye effects, t
il
the group effects (a group main effect together with its gene × group interaction), a
il
'spot' effects (an array main effect together with its gene × array interaction), and the ε
ijkl
random error.
The spot effects a
ij
are allowed to change for every gene-array combination. However, there is evidence that such effects are functionally related within each array [2]. As such, fitting every point individually will result in an overfit. On the other hand, the dye effects d
k
are only allowed two values, one for each dye. There is also evidence for intensity-dependent dye effects [2–4]. As such, restricting dye changes to constant shifts will result in an underfit. A more flexible dye term could be used instead in equation 1. However, together with the overfit array terms, this would quickly use up all available degrees of freedom for model fitting.
A natural way to incorporate intensity-dependent relationships is to replace the factor terms in equation 1 with functions of the underlying RNA amounts, creating the fANOVA model:
y
ijkl
= d(x
il
) + δ
k
(x
il
) + a
j
(x
il
) + ε
ijkl
,
where x
il
is the average RNA amount for gene i in group l. In terms of the model given in equation 1, μ
i
+ t
il
is now represented by d(x
il
), d
k
by δ
k
(x
il
), and a
ij
by a
j
(x
il
). General intensity-dependent biases due to differences between the dyes can now be flexibly modeled with the δ
k
(x
il
). Similarly, array-specific biases that are intensity-dependent can be modeled using the a
j
(x
il
), without the use of separate factor terms for every gene-array combination. To make model of equation 2 identifiable, we require that
and
for any argument x.
Note that 'RNA amount' x
il
is an abstract quantity meant only to help conceptualize the problem. Ideally, 'RNA amount' would mean the average number of RNA molecules available in group l for hybridization to the spot that characterizes gene i. Practically, we require only an 'RNA-equivalent amount', a bias-free monotone function of RNA count. As will be detailed below, we actually estimate a translated version of the model in equation 2 that is in terms of RNA-equivalent arguments that can be estimated from the data. Having said that, for convenience, we refer to the x
il
as 'RNA amount' throughout the paper.
While the form of the model in equation 2 is simple enough, a major problem is immediately apparent: we do not know the RNA amounts x
il
. If we did know the x
il
, we would preprocess the single-channel data in such a way that dye and array biases are removed, but relationships between the comparison groups are not affected. In terms of the model in equation 2, this can be accomplished by subtracting off the terms δ
k
(x
il
)+a
j
(x
il
) (leaving us with the quantities of interest, d(x
il
)). This would remove bias from both the individual observations and the average differences between comparison groups.
Note that MA-smoothing methods make assumptions about the form of the x
il
. For example, MA-smoothing methods assume that the gene-specific group differences x
il
-x
il'
are symmetric about zero and that these differences do not change systematically with gene-specific group averages (x
il
+x
il'
)/2 [11]. Since the x
il
are unknown, and no information about them is available without performing a dye-swap, the MA-smoothing assumptions are unverifiable in practice. Our estimation procedure does not assume any particular structure for the x
il
. The x
il
are merely arguments for the functions to be estimated. Our method thus applies regardless of whether there is any asymmetry, intensity-dependent differential expression, unequal variability between groups, more than 50% differential expression, and so on.
The efficient Common Array Dye Swap
In what can generally be called a 'dye swap', dye configuration is swapped in some arrays relative to others. That is, on some arrays, the red dye is used for group 1 and the green dye for group 2, while for other arrays this configuration is reversed. Typically, a 'technical dye-swap' is carried out, where the swapping occurs on technical replicate arrays. The general model in equation 2 is valid as long as there are arrays under both dye configurations. In particular, no technical replication is necessary, obviating the need for a technical dye-swap design. One simple implementation has half the arrays under one dye configuration (group 1 labeled red and group 2 green, say, in a direct-comparison experiment) and the other half under the other configuration. We define an 'efficient dye-swap' as swapping dye configuration on arrays that represent biological, rather than technical, replicates.
In [11], we proposed the CADS method for normalization when technical dye-swaps are used in direct-comparison experiments (with both comparison groups on the same array). The CADS model also uses functions of RNA amount to represent intensity-dependent biases. However, CADS is restricted to direct comparisons with technical dye-swaps and does not easily incorporate additional covariates or sources of bias. Since the present method is targeted at efficient dye-swap designs, we refer to it as eCADS; our use of the term 'efficient' is intended as a distinction between 'efficient'and 'technical' dye-swaps, not as a claim of efficiency of an estimator as used in statistics. eCADS is still applicable to direct-comparisons and technical replicates and, thus, can be seen as a generalization of CADS. As we discuss below, the eCADS model is estimated in a two-stage process. Under a simple form of balanced design, this estimation procedure preserves differential expression relationships in expectation.
eCADS can be applied to any valid experimental design using two-channel microarrays. Examples include: direct comparison experiments in which two groups are compared on the same array; indirect comparisons using a reference design (the dye used for the reference group must be swapped; and experiments in which there are more than two comparison groups. Additional covariates or sources of bias are also naturally incorporated. Furthermore, any feature of the model can be represented as either an intensity-dependent function or a traditional factor term.
We have shown in previous work that a technical dye-swap is necessary for removing dye bias in general from a single pair of samples [11]. With eCADS, we provide a more general result. Namely, dye bias can be removed from a set of sample pairs with dye-swaps on biological replicates. Technical dye-swaps are frequently avoided due to the inconvenience of making technical replicate arrays. MA methods are often used instead, at least partly due to their requiring only one array per sample pair. However, we have shown that MA methods require strong assumptions that cannot be validated in practice. eCADS handles the same intensity-dependent biases that are targeted by MA methods and also requires only one array per sample pair. In addition, eCADS does not require the restrictive assumptions of MA methods.
Estimation
Unknown functional arguments
It is usually the case in the regression setting that one wants to characterize the association between the dependent and independent variables. In our setting, fluorescence intensity is the dependent variable, and RNA amount is the independent variable; recall comments about our use of 'RNA amount' in the section 'Model formulation'. However, for purposes of normalization, our goal is to estimate the model terms representing bias and remove them. As it turns out, we are able to estimate and remove bias terms from the eCADS model without having to directly observe or estimate the independent variable. This is done by estimating a translated version of the independent variable, borrowing strength across arrays, and redefining the model to be in terms of the translated version.
The left panel of Figure 1 illustrates a hypothetical realization of the model in equation 2. On the x-axis is RNA amount, and each of the component functions of the model in equation 2 for a single array are labeled. The fluorescence intensities y
R
(red) and y
G
(green) are the sum of the average dye function d, the dye-specific functions δ
R
or δ
G
, and the array-specific function a. The fluorescence intensity for a particular gene labeled red, say, is the evaluation of y
R
at that gene's RNA amount, plus random error.
While we cannot estimate the actual RNA amounts for each gene, we can translate the model in equation 2 so that it is in terms of a quantity that we can estimate. Specifically, we redefine the model in equation 2 to be defined in terms of d(x
il
) instead of x
il
. As illustrated in the right panel of Figure 1, this transformation of the model warps the x-axis in the left panel of Figure 1, leaving the fluorescence intensities unchanged. That is, the curves from the left panel have been reshaped by changing the positions of their arguments on the x-axis. Since we have reshaped around d(x
il
), the new average dye function is the line of equality. Also, we have not affected the qualitative characteristics of the dye- and array-specific functions. Having translated the model, we proceed with the two-stage estimation process: first, estimate the warped RNA amounts; and second, estimate the translated version of the model in equation 2, plugging in the warped RNA estimates.
In the simplest case, with group l labeled with both dyes on an equal number of arrays, we can estimate d(x
il
) by simply averaging all observations for gene i in group l; the dye and array effects cancel out. More generally, we estimate the d(x
il
) by fitting separate regression models to each gene (see the appendix in Additional data file 1 for details). As a result, the 'warped RNA amounts' can be thought of as simple group means, adjusted gene-by-gene for bias. Thus, in general, eCADS begins with the fitted values from gene-specific regressions, analogous to the ANOVA approach. However, these gene-specific estimates are smoothed across genes by plugging them into the functions of the eCADS model.
Parameterizing the model
Regression or ANOVA models are typically represented by model matrices. The same principles apply in the microarray setting [25]. An efficient dye-swap design can be characterized by a standard design matrix Z with rows for each array channel and columns for the comparison groups, dyes, and arrays. Specifically, with n arrays and p comparison groups, define Z to be the 2n × (n + p + 2) matrix with hth row equal to:
h = 1,2,...,2n. The component indicates whether the hth channel is from comparison group l, indicates whether the kth dye was used for labeling the hth channel, and indicates whether channel h is from array j. As an example, an experiment making direct comparisons between two groups using two arrays would have:
We express the functions of the model of equation 2 in terms of basis matrices. By combining these basis matrices with the information in the model matrix Z, we can write the eCADS model as a typical regression model in matrix form. Estimation is carried out by minimizing a sum-of-squares criterion subject to identifiability constraints on the regression parameters (details are given in the appendix in Additional data file 1). We emphasize that the eCADS model can be applied to any 'valid' (derived from an experimental design that allows estimation of the parameters of interest) model matrix Z. This includes direct comparisons, indirect comparisons (reference designs), and multiple comparison groups.
The fANOVA model and its basis matrix representation allow for flexible choices of the form of its component functions, as well as for the inclusion of additional sources of bias. The dye functions are defined as being monotone, and this could be explicitly incorporated into the model, as discussed in chapter 6 of [26]. Time-dependent biases, as can arise when forming arrays over periods of weeks or months, could be incorporated by adding functions of time. Spatial biases could be incorporated through the inclusion of a two-dimensional smoother, as in [27]. In the analyses that follow, we use natural cubic spline bases [28] with five degrees of freedom.
Operating characteristics
eCADS preserves differential expression relationships
We now consider making inference on data normalized with eCADS. Because fluorescence intensities are nonlinear warped versions of RNA amount even with perfect normalization, it is not possible to preserve relative fold-change amounts in terms of direct RNA counts. It is possible, however, to preserve the sign and/or existence of differential expression. Specifically, after eCADS normalization, the expected value of a test statistic for a particular gene that compares two groups should equal the sign of the true difference in RNA amount. This means that, in expectation, null genes are called null, overexpressed genes are called overexpressed, and underexpressed genes are called underexpressed. Since differential expression methods are almost always based on sample averages [29], we assume that the test statistics are based on sample averages.
It can be shown (see the appendix in Additional data file 1) that eCADS normalization preserves the sign of differential expression in expectation under a simple kind of balance in experimental design that we refer to as 'balance with respect to comparison group'. An efficient dye-swap design is 'balanced with respect to comparison group' if each experimental level for any one factor is repeated the same number of times for each comparison group. For example, a direct comparison of two groups with n arrays that follows the model in equation 2 is balanced with respect to comparison group if n/2 arrays have one dye configuration, and n/2 have the other. If indirect comparisons are made using reference designs, dye configuration must still be swapped. Suppose, for example, that there are three comparison groups, with n1 arrays for group 1, n2 arrays for group 2, n3 and arrays for group 3, with the same reference sample used for all arrays. Balance with respect to comparison group requires that n1/2 of the arrays corresponding to group 1 have the group 1 target labeled red, while the other n1/2 have reference labeled red, for example. The arrays in the other groups must similarly be broken into equal groups by dye configuration.
To further illustrate the generality of the eCADS model, consider the following example. In making direct comparisons between two groups, suppose that there is bias due to dye, array, and array batch, with B batches. Suppose also that we would like to adjust for gender when comparing the groups. We thus consider the model:
y
ijklrs
= d(x
il
) + δ
k
(x
il
) + a
j
(x
il
) + c
r
(x
il
) + b
s
(x
il
) + ε
ijklrs
,
where c
r
corresponds to the rth gender, r = 1,2, and b
S
corresponds to the sth array batch, s = 1,2,...,B. As with the other model terms, for any argument x. Normalization by eCADS would remove the biases due to dye, array, and batch. Inference would then be carried out by a weighted average of group differences within each gender, as with a gene-specific linear regression. Balance with respect to comparison group here requires: n/2 arrays in each dye configuration; n/B arrays in each array batch; and n/2 males in group 1, n/2 females in group 1, n/2 males in group 2, and n/2 females in group 2. Note that it does not matter how these factors are arranged with respect to one another.
The question remains of the penalty incurred when perfect balance with respect to comparison group is not possible due, for example, to an odd number of arrays. In general, using plug-in estimates for RNA amounts x is analogous to regressing on a covariate that has been measured with error. Outside of very simple settings, it is not possible to fully characterize the bias that results from such measurement error problems. However, as seen in the simulations above, there does not appear to be much of a penalty for having minor imbalances.