# Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application

- Cheng Li
^{1}and - Wing Hung Wong
^{1, 2}Email author

**2**:research0032.1

https://doi.org/10.1186/gb-2001-2-8-research0032

© Li and Wong, licensee BioMed Central Ltd 2001

**Received: **15 February 2001

**Accepted: **13 June 2001

**Published: **3 August 2001

## Abstract

### Background

A model-based analysis of oligonucleotide expression arrays we developed previously uses a probe-sensitivity index to capture the response characteristic of a specific probe pair and calculates model-based expression indexes (MBEI). MBEI has standard error attached to it as a measure of accuracy. Here we investigate the stability of the probe-sensitivity index across different tissue types, the reproducibility of results in replicate experiments, and the use of MBEI in perfect match (PM)-only arrays.

### Results

Probe-sensitivity indexes are stable across tissue types. The target gene's presence in many arrays of an array set allows the probe-sensitivity index to be estimated accurately. We extended the model to obtain expression values for PM-only arrays, and found that the 20-probe PM-only model is comparable to the 10-probe PM/MM difference model, in terms of the expression correlations with the original 20-probe PM/MM difference model. MBEI method is able to extend the reliable detection limit of expression to a lower mRNA concentration. The standard errors of MBEI can be used to construct confidence intervals of fold changes, and the lower confidence bound of fold change is a better ranking statistic for filtering genes. We can assign reliability indexes for genes in a specific cluster of interest in hierarchical clustering by resampling clustering trees. A software dChip implementing many of these analysis methods is made available.

### Conclusions

The model-based approach reduces the variability of low expression estimates, and provides a natural method of calculating expression values for PM-only arrays. The standard errors attached to expression values can be used to assess the reliability of downstream analysis.

## Background

The statistical model proposed in [1] for one probe set in multiple oligonucleotide arrays has the form

It states that the perfect match (PM)/mismatch (MM) difference in array *i,* probe *j* of this probe set is the product of model-based expression index (MBEI) in array *i* (θ_{
i
}) and probe-sensitivity index of probe *j* (
_{
j
}) plus random error. Here *J* is the number of probe pairs in the probe set. Fitting the model, we can identify cross-hybridizing probes (
_{
j
} with large standard error (SE), which are excluded during iterative fitting) and arrays with image contamination at this probe set (θ_{
i
} with large SE), as well as single outliers (image spikes) which are replaced by the fitted values. In effect the estimated expression index θ_{
i
} is a weighted average of PM/MM differences:

## Results and discussion

### Probe-sensitivity indexes are stable across tissue types

*0*) for array set 5.

In Figure 1a,b the target gene is present in most samples of all array sets. For a probe set whose target gene is mostly absent throughout samples (Figure 1c), many probes are identified as probe-outliers because of their negative indexes. Here, we cannot obtain correct probe-sensitivity indexes because of the absence of the target gene. Nevertheless, the PM-MM values for these probes are random fluctuations around zero, leading to a correct expression index close to zero. If the target gene becomes available for a future array set, the correct probe-sensitivity indexes will be recovered and these probes will be used for expression calculation.

Figure 2 shows the boxplots of average pairwise correlations of values between two array sets, stratified by average lower presence proportion in the two sets. In general, when a gene is present in many samples of two array sets, the patterns estimated from the two sets are very similar. The target gene's presence in many arrays of an array set allows the probe-sensitivity index to be estimated accurately.

### Model-based analysis for PM-only arrays

From Figure 1 of [1], one can see that some MM probes may respond poorly to the changes in the expression level of the target gene. This phenomenon raised questions on the efficiency of using MM probes, and led some investigators to design custom arrays that use PM probes exclusively (R. Abagyan and Yingyao Zhou, personal communication; B.R. Conklin, personal communication), and others to calculate fold changes using only PM probes (F. Naef, personal communication). This design greatly increases the number of genes that can be studied on one array. To investigate the relative performance of PM/MM versus PM-only designs, we exploited the model to estimate gene expression levels using only PM probes, and compared it to the MBEI using both PM and MM probes.

The full intensity model (equation 1 of [1]) specifies the relationship of PM probe responses and expression level θ:

where *v*_{
j
} is the baseline response of probe pair *j* due to nonspecific hybridization, and
'_{
j
} is the sensitivity of PM probe of the probe pair *j*. The parameter estimates can be obtained by iteratively fitting θ_{
i
} and *v*_{
j
},
'_{
j
}, regarding the other set as known. The same outlier exclusion procedure in [1] is applied. The MM probe responses have a similar form as equation 2 except for different probe-sensitivity indexes. We fit a PM-only and an MM-only model to obtain expression values of all 20-probe probe sets using array set 1. For comparison, we also used half of the probe pairs (by alternatively picking one out of every two probes) in a 20-probe probe set to fit to the difference model (equation 1). For each probe set, these three sets of expression values were compared with the expression values of the original difference model using 20 probes, in terms of correlation of θs obtained by two methods across the 21 arrays. We assumed the 20-probe difference model provides the most accurate expression estimates. If, for a probe set, a simplified model (PM-only, MM-only or 10-probe difference model) performs reasonably well, we expect its θ estimates to correlate with that from the 20-probe difference model.

This comparison corroborates the basic notion of the technology: the PM probes hybridize more strongly to the target signals than MM probes and contain most of the information. We stress that, whereas the above analysis illustrates the applicability of model-based analysis to PM-only arrays, the assessment presented here is only tentative because of the limited information provided by the HU6800 arrays on the comparisons. Definitive comparisons of the efficiency of the designs must await the availability of data from PM-only arrays.

### MBEI reduces variability for low expression estimates

The array set 5 has 29 pairs of arrays [2]. Each pair consists of two arrays hybridizing to samples replicated at total mRNA level (the total mRNA sample is split and then amplified and labeled separately, and hybridized to two different arrays). The differences between the expression values of the two replicate arrays in a pair are due to the variation introduced in experimental steps after the split, the array manufacturing difference and analytical methods such as normalization and expression calculation. This difference provides a lower bound of biological variation that can be detected between two independently amplified samples, and serves as a good statistic for comparing different analytical methods.

### Confidence interval for fold change

After obtaining expression indexes using AD or MBEI, fold changes can be calculated between two arrays for every gene and used to identify differentially expressed genes. Usually, low or negative expressions are truncated to a small number before calculating fold changes, and GeneChip also cautions against using fold changes when the baseline expression is absent.

The availability of SEs for the model-based expression indexes allows us to obtain confidence intervals for fold changes. Suppose

where θ_{1} and θ_{2} are the real expression levels in the sample, and
_{1} and
_{2} are the model-based estimates of expression levels. We substitute the model-based SEs for δ_{1} and δ_{2} Letting *r* = θ_{1}*/θ*_{2} be the real fold change, then inference on r can be based on the quantity

It can be shown that Q has a Χ^{2} distribution with 1 degree of freedom irrespective of the values of θ_{1} and θ_{2} [4]. Thus Q is a pivotal quantity involving r. We can use Q to construct fixed-level tests and to invert them to obtain confidence intervals (CI) for fold changes [5].

Using expression levels and associated SEs to determine confidence intervals of fold changes

Expression 1 | Std Error 1 | Expression 2 | Std Error 2 | Fold Change | Lower CB | Upper CB | |
---|---|---|---|---|---|---|---|

Gene 1 | 859.635 | 41.7808 | 347.57 | 36.0887 | 2.47327 | 2.06844 | 3.02672 |

Gene 2 | 405.72 | 31.2305 | 164.014 | 44.2505 | 2.47369 | 1.66938 | 4.49127 |

Gene 3 | 283.931 | 28.5281 | 114.705 | 18.4661 | 2.47531 | 1.83926 | 3.48466 |

Gene 4 | 45.9821 | 64.2419 | 18.5727 | 84.5308 | 2.47579 | 0 | Infinity |

Gene 5 | 225.178 | 57.489 | 90.9045 | 36.1766 | 2.47709 | 1.18104 | 7.48749 |

Gene 6 | 247.002 | 50.6518 | 99.6642 | 19.5384 | 2.47834 | 1.51079 | 4.0211 |

Gene 7 | 49.9739 | 21.5345 | 20.1514 | 23.5651 | 2.47992 | 0.487603 | Infinity |

Gene 8 | 276.491 | 18.6883 | 111.373 | 36.1004 | 2.48256 | 1.59069 | 5.34635 |

Gene 9 | 436.071 | 32.9779 | 175.384 | 21.0669 | 2.48638 | 1.98665 | 3.18811 |

Gene 10 | 75.6914 | 17.7215 | 30.4395 | 17.9707 | 2.48662 | 1.07209 | 86.1656 |

Gene 11 | 80.673 | 25.3085 | 32.4314 | 16.9626 | 2.4875 | 0.960787 | 18.1833 |

Gene 12 | 181.528 | 42.4837 | 72.8751 | 28.1787 | 2.49094 | 1.24668 | 7.11945 |

Gene 13 | 1122.28 | 99.2835 | 449.889 | 63.2821 | 2.49456 | 1.92075 | 3.35055 |

Gene 14 | 168.234 | 40.629 | 67.4387 | 30.2982 | 2.49462 | 1.17639 | 9.81547 |

In practice, we find it useful to sort genes by the lower confidence bound ('Lower CB' in Table 1), which is a conservative estimate of the fold change. When an expression index is negative (as a result of taking PM/MM differences), we do not calculate the confidence intervals. In such a case, it is more helpful to filter genes by presence calls.

### Standard errors help to assess clustering results

Cluster analysis is a popular method for analyzing the data of a series of microarrays [6,7]. If two genes are co-regulated at the transcription level, their expression values across samples are likely to be correlated. Clustering algorithms use these correlations (or monotone transformation of correlations) to cluster co-regulated genes together. The correlation based on the estimated expression levels may, however, be different from that based on the real but unobserved expression levels. Also, the commonly used hierarchical clustering algorithm is an irreversible process: once two genes or nodes are merged, they will stay together, even if later on there is good reason to adjust previous clustering. Thus there is a need to assess the reliability of clusters.

A global way of using SE in hierarchical clustering is to resample or bootstrap [8] the whole 'gene by sample' data matrix and redo the clustering, then investigate the overall properties emerging from this repertoire of clustering trees. In Bittner *et al.* [9], the data matrix coming from cDNA microarray experiments is resampled using the estimated variation derived from the median SD of log ratios for a gene across samples. As we now have SEs for all data points, we can resample each expression value from a normal distribution with mean equal to the estimated expression value and SD equal to the attached SE.

We can see from Figure 8c that most genes in the original cluster are reliable members, whereas a few genes at the bottom of the cluster are not (in fact they are merged into the original cluster last). Interestingly, some genes originally not in the original cluster group with the 'corresponding clusters' during resampling many times and have gray 'in-cluster' marks. These genes may be related to the original cluster in some way. In summary, this method can help us to distinguish reliable and unreliable gene members of a cluster, as well as draw our attention to related genes originally clustered somewhere else because of the accidental nature of hierarchical clustering.

## Methods and materials

### Software

We have developed a software package DNA-Chip Analyzer (dChip [10]) to perform invariant-set normalization (see below), calculation of MBEI [1], computation of confidence intervals of fold changes, and hierarchical clustering with resampling.

Our experience is that more than 10 arrays are appropriate for model training, outlier detection and MBEI calculation. Researchers with fewer than 10 arrays may seek arrays of the same chip type and hybridizing to similar tissue samples, and combine them in a single dChip analysis session. We are exploring model-based meta-analysis of many arrays of the same chip type but hybridizing to a heterogeneous set of tissues samples, and will present such analysis in future work.

### Normalization of arrays based on an 'invariant set'

A normalization relation can be understood as a curve in the scatterplot of two arrays with the baseline array drawn on the *y*-axis and the array to be normalized on the x-axis. A straight line running through the origin is a multiplicative normalization method (GeneChip's scaling method), and a smoothing spline through the scatterplot can also be used (Figure 9a, also see [11]).

*n =*140,000) is small enough, it is kept for the new set. Here the threshold of being small is PRD < 0.003 when a points's average intensity ranks in the two arrays is small and PRD < 0.007 when it is large, accounting for fewer points at high-intensity range; and the threshold is interpolated in between. We chose these parameters empirically to make the selected points in the invariant set thin enough to naturally determine a normalization relation. In this way we may obtain a new set of 10,000 points, and the same procedure is applied to the new set iteratively, until the number of points in the new set does not decrease anymore. A piecewise linear running median line is then calculated and used as the normalization curve. After normalization, the two arrays have similar overall brightness. (Figure 9c). Figure 10 shows another pair of arrays where the normalization relationship is non-linear.

## Declarations

### Acknowledgements

We thank Sven de Vos, Dan Tang, Nik Brown, Stan Nelson, Jae K. Lee, Yaron Hakak, John Walker and Arindam Bhattacharjee for providing data, and the editor and referees who provided valuable suggestions. This work is supported in part by NIH grant 1 RO1 HG02341-01 and NSF grant DBI-9904701.

## Authors’ Affiliations

## References

- Li C, Wong WH: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001, 98: 31-36. 10.1073/pnas.011404098.PubMedPubMed CentralView ArticleGoogle Scholar
- Hakak Y, Walker JR, Li C, Wong WH, Davis KL, Buxbaum JD, Haroutunian V, Fienberg AA: Genome-wide expression analysis reveals dysregulation of myelination-related genes in chronic schizophrenia. Proc Natl Acad Sci USA. 2001, 98: 4746-4751. 10.1073/pnas.081071198.PubMedPubMed CentralView ArticleGoogle Scholar
- Wodicka L, Dong H, Mittmann M, Ho M, Lockhart D: Genome-wide expression monitoring in
*Saccharomyces cerevisiae*. Nat Biotechnol. 1997, 15: 1359-1367.PubMedView ArticleGoogle Scholar - Wallace D: The Behrens-Fisher and Fieller-Creasy problems. In Lecture Notes in Statistics 1, R.A.Fisher: An Appreciation. Edited by Fienberg SE, Hinkley DV. Springer-Verlag. 1988, 119-147.Google Scholar
- Cox DR, Hinkley DV: Theoretical Statistics. London: Chapman and Hall,. 1974View ArticleGoogle Scholar
- Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.PubMedPubMed CentralView ArticleGoogle Scholar
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA. 1999, 96: 2907-2912. 10.1073/pnas.96.6.2907.PubMedPubMed CentralView ArticleGoogle Scholar
- Efron B, Tibshirani R: An Introduction to the Bootstrap. New York: Chapman & Hall/CRC,. 1993View ArticleGoogle Scholar
- Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Rad-macher M, Simon R, Yakhinik Z, Ben-Dork A, et al: Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature. 2000, 406: 536-540. 10.1038/35020115.PubMedView ArticleGoogle Scholar
- DNA-Chip Analyzer. [http://www.dchip.org]
- Schadt EE, Li C, Su C, Wong WH: Analyzing high-density oligonucleotide gene expression array data. J Cell Biochem. 2001, 80: 192-202. 10.1002/1097-4644(20010201)80:2<192::AID-JCB50>3.0.CO;2-W.View ArticleGoogle Scholar